How we build the harness behind the practice. Real work,
shipped to production. The log is evidence for clients and a public
trail for senior operators considering Gauges Green.
The Gauges Green Harness Series
A running account of the operating harness behind the practice.
Built for real client work, then extracted into public-safe
pieces operators can study and eventually use.
The agents, models, and vendors named in older posts are
implementation details. The recurring story is the harness around
the work: context, acceptance, gates, review, evidence, and
operator judgment.
Read it in three ways: clients should look for how context,
scorecards, and quality gates change the work; operators should
look for what they would no longer have to rebuild; builders
should look for patterns worth forking.
The Boot Card turns agent startup into an operator readout: mission, goal progress, and top priorities visible before the session starts.
May 18QualityKept the verification gate in shadow after the v9 bake review. Zero post-May-11 Stop-hook rows is not clean evidence. High-volume blockers need labeled shadow data or replay, not silence.
May 18AI InfrastructureClosed a direct-send safety gap by requiring approval envelopes at the WhatsApp send-script boundary. Hook-only safety is not enough when multiple runtimes can reach the same sender.
May 18HarnessRestored routed-memory freshness by giving the context router its own 5-minute scheduler. A live inbox is not enough if the routed memory layer is stale.
May 15HarnessTurned Proof Scout into an active detector for marketing proof moments. It scans recent harness work and prompts for the screenshot, redaction, filename, and likely use while the evidence is still fresh.
May 13AI InfrastructureAccepted the Codex Telegram runtime by explicit override while keeping the readiness basis visible. A human override is valid only if the telemetry says it was an override, not a clean bake.
May 12Client WorkShipped an active source-fidelity gate for public and client-adjacent claims. Unsupported claims now need a source, a hypothesis label, or a validation-question frame.
May 11AI InfrastructureThe Gauges Green PDLC
April 2026
Apr 26HarnessCollapsed the system name into the company. Gauges Green is now both the practice and the harness. Single brand across both. Andale becomes the boot injector subsystem (replacing Arriba). All other subsystem names unchanged.
Three weeks of an agent appearing dormant turned out to be a 30-second pairing handshake. Distribution failures look identical to production failures from the consumer's perspective.
Apr 12Client WorkBuilt Pluma, a document pipeline that turns Markdown into branded Google Docs. Same source file produces the website version and the client-facing version. Sharing is a separate, gated action.
Apr 12Client WorkAn agent fabricated quotes in a $3K/mo client-facing synthesis doc. Built a source fidelity gate: 7 new rules that mechanically block unverified claims from reaching clients. The model is not the moat. The gate is.
Apr 12Client WorkShipped the Jamie meeting transcription pipeline end to end. Webhook receiver with HMAC verification, per-speaker attribution, action items extracted and routed to the right agent automatically.
Apr 11AI InfrastructureKilled a potential Zoom AI Companion migration in an hour by verifying four independent kill criteria before writing a single line of code. The 30-second search that saves the 90-minute build.
Apr 11Client WorkRan a full market eval of bot-free meeting transcription tools in a single session. Reversed the verdict three times as new constraints surfaced. Ended on Jamie after ruling out Fireflies, Otter.ai (active wiretapping lawsuit), Granola (2-stream limit), and six others.
Apr 11HarnessMigrated all 7 agents from Slack to Telegram in a single afternoon. One new transport module, five script patches. The agents kept working through the transition.
Apr 10Client WorkShipped Client Mode Phase 1. Agent recommendations are now tier-scoped per client. A $1K/mo engagement gets one next step per session. A $15K/mo engagement gets five. The tier rules are injected at boot, not left to judgment.
Apr 10Client WorkPressure-tested a client's new monetization thesis against their own CRM data on day one of the engagement. The compiled context surfaced three inconsistencies between what the team believed and what their pipeline showed.
Apr 9Client WorkType a client codename and a dedicated Opus session opens with that client's entire vault: goals, product roadmap, meeting transcripts dating back to the first call, every email thread. Nothing else loaded. No cross-client bleed.
Apr 9Client WorkAdded session resume. Close a client session, reopen it the next day, and the agent picks up exactly where you left off with a lightweight delta of what changed overnight.
Apr 9AI InfrastructureCamTune (now Ojo) optimizes against what video call participants actually see, not the raw camera feed. Switched from camera capture to window capture. Your Zoom tile is the optimization target.
Apr 8HarnessBuilt a knowledge wiki using the Karpathy pattern. LLM-compiled reference articles from 638 Lenny Rachitsky episodes and 156 Marketing Examples posts. Agents cite compiled reference material from the wiki instead of hallucinating advice.
Apr 8QualityShipped 611 tests across the harness. Started at zero in late March. The suite catches contract mismatches between pipeline components before they reach production.
Apr 7Client WorkBuilt implicit performance tracking. A plugin detects approvals and corrections from natural conversation without any special syntax. 5,200 scored interactions across 7 agents. The data showed 56% approval in general chat, 98% in dedicated client sessions.
Apr 7Client WorkUpgraded all 7 agents from Sonnet to Opus for interactive sessions. 192 corrections traced to model capability, not bad instructions. Heartbeats stay on a cheap model. Three-tier routing: Opus for client work, Sonnet for background tasks, DeepSeek for breathing.
Apr 7HarnessPulse was feeding every agent all 323 contacts and 24 cross-domain projects. The marketing agent got fitness protocols. The Spanish tutor got deal pipelines. Added domain filtering so each agent sees only what belongs to its world.
Apr 6QualityDistilled 192 agent corrections into 31 behavioral rules and deployed them as mechanical enforcement. A gateway plugin blocks unauthorized external sends before they leave the system. Rules in a prompt are suggestions. This is a safeguard.
Apr 6HarnessDiagnosed why agents kept asking me to explain my own life every 48 hours. Working memory had no persistent layer. Added durable knowledge sections that survive automatic context refreshes.
Apr 6HarnessEvery agent now knows which topics belong to which peer and routes accordingly. Previously one of seven had explicit domain boundaries. The chief of staff was texting me about fitness supplements at 5 AM.
Apr 6HarnessAdded a governance rule for runtime upgrades: package changes require a dedicated session with rollback. The harness can move quickly without letting a routine upgrade take down the workday.
Apr 6QualityBuilt gateway cost tracking after the dashboard showed $146/day but only $3-5 was actually billed. The rest was covered by a flat-rate plan. Without separating billing tiers, every cost report was fiction.
Apr 5HarnessBuilt tab notifications for the terminal. When an agent finishes thinking, the tab lights up with a bell icon and a colored border. Small thing. Eliminated 30 minutes of daily context-switch overhead.
Apr 5HarnessLaunched a family WhatsApp group with the chief-of-staff agent as coordinator. It recommends restaurants with walk-time estimates and handles scheduling. My wife uses it more than I do.
Apr 5QualityOne command shows full system health across 7 dimensions. Pipeline freshness, credential TTLs, agent payload staleness, delivery timeouts, contract test results. Replaced inspecting 17 scattered JSON files.
Apr 3HarnessBuilt a shared context layer so all 7 agents know about each other. Two tiers: universal awareness for everyone, business context only for agents that need it. No cross-domain bleed.
Apr 3QualityPublished the first engineering scorecard across 8 pillars. Architecture was already strong. Observability and engineering quality became the load-bearing improvement areas.
Apr 2HarnessType "piper" in any terminal and a session opens with that agent's full compiled context. Identity, architecture, domain knowledge, live task state. Zero file reads. The CLI is the interface.
Apr 2HarnessSwapped 7 agent heartbeats from a 671B parameter model to a 3B parameter model. Same job, 85% cost reduction. Low-reasoning tasks run 100x more often than deep work, so model choice there dominates cost.
March 2026
Mar 30QualityBuilt two-stage CI/CD. Local commit hook under 15 seconds, remote GitHub Actions for the full suite. Path filtering so a docs change doesn't trigger a build test.
Mar 28QualityTook the codebase from D- to A-. 213 bare exception handlers eliminated. 876 print() calls converted to structured logging. 200 unit tests from zero. 14 shared libraries extracted from 5 monolithic scripts.
Mar 15HarnessShipped the intent engine. 31 detectors scan Gmail, Calendar, iMessage, WhatsApp, and task state every 5 minutes. Each agent gets a ranked priority queue of what needs attention right now. The system's nose for what matters.
Mar 1AI InfrastructureOne 400-line SOP.md that teaches Claude permission to deploy autonomously. Features go from code to production in a single conversation. No staging environment, no manual deploys.
February 2026
Feb 24HarnessBuilt Flare after a single session burned 430M tokens in 90 minutes from an auto-compact loop. Three-layer monitoring: a Claude Code hook tracks usage in real time, a daemon projects pace against the billing window, and the agent sees the projection at boot.
Feb 19HarnessBuilt the boot compiler that gives each agent source-backed operating context. 350 lines of Python, runs every 5 minutes, produces per-agent payloads up to 306KB. Seven agents boot with their domain state already assembled.
Feb 19HarnessBuilt Pulse, the unified inbox. iMessage, WhatsApp, Apple Notes, Slack, Gmail, Granola normalized into one JSONL stream. 37 iMessages and 219 Apple Notes flowed in on the first run. Every agent reads one file.
January 2026
Jan 20QualityBuilt Critique, a two-layer design quality gate. Code analysis checks source files for wrong patterns. Playwright captures screenshots in four variants and Claude vision judges the pixels. The design system document is the rule engine.