Build Log

How we build the harness behind the practice. Real work, shipped to production. The log is evidence for clients and a public trail for senior operators considering Gauges Green.

The Gauges Green Harness Series

A running account of the operating harness behind the practice. Built for real client work, then extracted into public-safe pieces operators can study and eventually use.

The agents, models, and vendors named in older posts are implementation details. The recurring story is the harness around the work: context, acceptance, gates, review, evidence, and operator judgment.

Read it in three ways: clients should look for how context, scorecards, and quality gates change the work; operators should look for what they would no longer have to rebuild; builders should look for patterns worth forking.

Start with the harness overview For senior operators Follow or fork the harness Subscribe by RSS

Start Here

Clients

Start with the Boot Card and scorecard. Those pages show how the harness changes operating rhythm and practice quality.

Boot Card Scorecard

Senior operators

Start with compiled context and the operator page. Those pages show the burden the harness removes.

Compiled context Operator path

Builders

Start with the harness overview, Critique, and the public GitHub demo. Those pages show patterns that can be inspected without private client data.

Harness Critique GitHub

Performance · 4w flat

July 2026

Jul 9 AI Infrastructure Replaced a 'never batch a risky send' rule that lived only as a design guideline with a hardcoded exclusion checked first in the code that decides what gets grouped together. A safety rule that exists as policy instead of as code is exactly the gap a shared verification layer elsewhere in the system already exists to close.
Jul 8 Client Work Fixed a bug where any meeting task with no listed owner was silently assumed to be Ryan's, including items that clearly belonged to someone else based on what they said. A safe-sounding default (nothing gets lost) turned out to be worse than the failure it was designed to prevent, once the real risk was misattribution rather than omission.
Jul 8 AI Infrastructure Turned on the internal ops app's own dark mode, then ran a real contrast scan against it instead of trusting how it looked. Found two accessibility failures a static review would have missed, including a status indicator that measured well under the required contrast ratio. Shipping a feature and shipping an accessible feature turned out to be two different claims, and only the live scan proved the second one.

May 2026

May 21
AI Infrastructure Why Chat Models Keep Reframing Everything
A sourced explanation for the negation-then-reframe sentence that keeps showing up in AI-generated prose.
May 19
AI Infrastructure The First Screen Is a Control Surface
The Boot Card turns agent startup into an operator readout: mission, goal progress, and top priorities visible before the session starts.
May 18 Quality Kept the verification gate in shadow after the v9 bake review. Zero post-May-11 Stop-hook rows is not clean evidence. High-volume blockers need labeled shadow data or replay, not silence.
May 18 AI Infrastructure Closed a direct-send safety gap by requiring approval envelopes at the WhatsApp send-script boundary. Hook-only safety is not enough when multiple runtimes can reach the same sender.
May 18 Harness Restored routed-memory freshness by giving the context router its own 5-minute scheduler. A live inbox is not enough if the routed memory layer is stale.
May 15 Harness Turned Proof Scout into an active detector for marketing proof moments. It scans recent harness work and prompts for the screenshot, redaction, filename, and likely use while the evidence is still fresh.
May 13 AI Infrastructure Accepted the Codex Telegram runtime by explicit override while keeping the readiness basis visible. A human override is valid only if the telemetry says it was an override, not a clean bake.
May 12 Client Work Shipped an active source-fidelity gate for public and client-adjacent claims. Unsupported claims now need a source, a hypothesis label, or a validation-question frame.
May 11 AI Infrastructure The Gauges Green PDLC

April 2026

Apr 26 Harness Collapsed the system name into the company. Gauges Green is now both the practice and the harness. Single brand across both. Andale becomes the boot injector subsystem (replacing Arriba). All other subsystem names unchanged.
Apr 25
Harness The Agent Was Fine. The Handshake Was Broken.
Three weeks of an agent appearing dormant turned out to be a 30-second pairing handshake. Distribution failures look identical to production failures from the consumer's perspective.
Apr 12 Client Work Built Pluma, a document pipeline that turns Markdown into branded Google Docs. Same source file produces the website version and the client-facing version. Sharing is a separate, gated action.
Apr 12 Client Work An agent fabricated quotes in a $3K/mo client-facing synthesis doc. Built a source fidelity gate: 7 new rules that mechanically block unverified claims from reaching clients. The model is not the moat. The gate is.
Apr 12 Client Work Shipped the Jamie meeting transcription pipeline end to end. Webhook receiver with HMAC verification, per-speaker attribution, action items extracted and routed to the right agent automatically.
Apr 12 Client Work Named the harness parts. Andale opens agent sessions. Pulse watches inbound channels. Dispatch manages tasks. Scribe captures meetings. Pluma renders deliverables. Critique checks quality. Snitch audits outbound sends.
Apr 11 AI Infrastructure Killed a potential Zoom AI Companion migration in an hour by verifying four independent kill criteria before writing a single line of code. The 30-second search that saves the 90-minute build.
Apr 11 Client Work Ran a full market eval of bot-free meeting transcription tools in a single session. Reversed the verdict three times as new constraints surfaced. Ended on Jamie after ruling out Fireflies, Otter.ai (active wiretapping lawsuit), Granola (2-stream limit), and six others.
Apr 11 Harness Migrated all 7 agents from Slack to Telegram in a single afternoon. One new transport module, five script patches. The agents kept working through the transition.
Apr 10 Client Work Shipped Client Mode Phase 1. Agent recommendations are now tier-scoped per client. A $1K/mo engagement gets one next step per session. A $15K/mo engagement gets five. The tier rules are injected at boot, not left to judgment.
Apr 10 Client Work Pressure-tested a client's new monetization thesis against their own CRM data on day one of the engagement. The compiled context surfaced three inconsistencies between what the team believed and what their pipeline showed.
Apr 9 Client Work Type a client codename and a dedicated Opus session opens with that client's entire vault: goals, product roadmap, meeting transcripts dating back to the first call, every email thread. Nothing else loaded. No cross-client bleed.
Apr 9 Client Work Added session resume. Close a client session, reopen it the next day, and the agent picks up exactly where you left off with a lightweight delta of what changed overnight.
Apr 9 AI Infrastructure CamTune (now Ojo) optimizes against what video call participants actually see, not the raw camera feed. Switched from camera capture to window capture. Your Zoom tile is the optimization target.
Apr 8 Harness Built a knowledge wiki using the Karpathy pattern. LLM-compiled reference articles from 638 Lenny Rachitsky episodes and 156 Marketing Examples posts. Agents cite compiled reference material from the wiki instead of hallucinating advice.
Apr 8 Quality Shipped 611 tests across the harness. Started at zero in late March. The suite catches contract mismatches between pipeline components before they reach production.
Apr 7 Client Work Built implicit performance tracking. A plugin detects approvals and corrections from natural conversation without any special syntax. 5,200 scored interactions across 7 agents. The data showed 56% approval in general chat, 98% in dedicated client sessions.
Apr 7 Client Work Upgraded all 7 agents from Sonnet to Opus for interactive sessions. 192 corrections traced to model capability, not bad instructions. Heartbeats stay on a cheap model. Three-tier routing: Opus for client work, Sonnet for background tasks, DeepSeek for breathing.
Apr 7 Harness Pulse was feeding every agent all 323 contacts and 24 cross-domain projects. The marketing agent got fitness protocols. The Spanish tutor got deal pipelines. Added domain filtering so each agent sees only what belongs to its world.
Apr 6 Quality Distilled 192 agent corrections into 31 behavioral rules and deployed them as mechanical enforcement. A gateway plugin blocks unauthorized external sends before they leave the system. Rules in a prompt are suggestions. This is a safeguard.
Apr 6 Harness Diagnosed why agents kept asking me to explain my own life every 48 hours. Working memory had no persistent layer. Added durable knowledge sections that survive automatic context refreshes.
Apr 6 Harness Every agent now knows which topics belong to which peer and routes accordingly. Previously one of seven had explicit domain boundaries. The chief of staff was texting me about fitness supplements at 5 AM.
Apr 6 Harness Added a governance rule for runtime upgrades: package changes require a dedicated session with rollback. The harness can move quickly without letting a routine upgrade take down the workday.
Apr 6 Quality Built gateway cost tracking after the dashboard showed $146/day but only $3-5 was actually billed. The rest was covered by a flat-rate plan. Without separating billing tiers, every cost report was fiction.
Apr 5 Harness Built tab notifications for the terminal. When an agent finishes thinking, the tab lights up with a bell icon and a colored border. Small thing. Eliminated 30 minutes of daily context-switch overhead.
Apr 5 Harness Launched a family WhatsApp group with the chief-of-staff agent as coordinator. It recommends restaurants with walk-time estimates and handles scheduling. My wife uses it more than I do.
Apr 5 Quality One command shows full system health across 7 dimensions. Pipeline freshness, credential TTLs, agent payload staleness, delivery timeouts, contract test results. Replaced inspecting 17 scattered JSON files.
Apr 3 Harness Built a shared context layer so all 7 agents know about each other. Two tiers: universal awareness for everyone, business context only for agents that need it. No cross-domain bleed.
Apr 3 Quality Published the first engineering scorecard across 8 pillars. Architecture was already strong. Observability and engineering quality became the load-bearing improvement areas.
Apr 2 Harness Type "piper" in any terminal and a session opens with that agent's full compiled context. Identity, architecture, domain knowledge, live task state. Zero file reads. The CLI is the interface.
Apr 2 Harness Swapped 7 agent heartbeats from a 671B parameter model to a 3B parameter model. Same job, 85% cost reduction. Low-reasoning tasks run 100x more often than deep work, so model choice there dominates cost.

March 2026

Mar 30 Quality Built two-stage CI/CD. Local commit hook under 15 seconds, remote GitHub Actions for the full suite. Path filtering so a docs change doesn't trigger a build test.
Mar 28 Quality Took the codebase from D- to A-. 213 bare exception handlers eliminated. 876 print() calls converted to structured logging. 200 unit tests from zero. 14 shared libraries extracted from 5 monolithic scripts.
Mar 15 Harness Shipped the intent engine. 31 detectors scan Gmail, Calendar, iMessage, WhatsApp, and task state every 5 minutes. Each agent gets a ranked priority queue of what needs attention right now. The system's nose for what matters.
Mar 1 AI Infrastructure One 400-line SOP.md that teaches Claude permission to deploy autonomously. Features go from code to production in a single conversation. No staging environment, no manual deploys.

February 2026

Feb 24 Harness Built Flare after a single session burned 430M tokens in 90 minutes from an auto-compact loop. Three-layer monitoring: a Claude Code hook tracks usage in real time, a daemon projects pace against the billing window, and the agent sees the projection at boot.
Feb 19 Harness Built the boot compiler that gives each agent source-backed operating context. 350 lines of Python, runs every 5 minutes, produces per-agent payloads up to 306KB. Seven agents boot with their domain state already assembled.
Feb 19 Harness Built Pulse, the unified inbox. iMessage, WhatsApp, Apple Notes, Slack, Gmail, Granola normalized into one JSONL stream. 37 iMessages and 219 Apple Notes flowed in on the first run. Every agent reads one file.

January 2026

Jan 20 Quality Built Critique, a two-layer design quality gate. Code analysis checks source files for wrong patterns. Playwright captures screenshots in four variants and Claude vision judges the pixels. The design system document is the rule engine.