Compiled Context: How I Stopped Fighting Agent Memory and Started Engineering It

February 2026

Compiled Context: How I Stopped Fighting Agent Memory and Started Engineering It

Compiled context flow from source systems into the boot compiler and agent payloads — The harness story starts here: assemble the business before the agent starts reasoning.

The model named in this archive is incidental. The durable pattern is the operating surface around it: source-backed context, reviewable state, and human control.

I spent three weeks building rules for my AI agents to remember things.

"Write it down when Ryan tells you a fact." "Check your context before asking questions." "Log every correction."

The agents ignored all of it for structural reasons. Session memory is ephemeral, context windows have limits, and the instinct to apologize and move on is stronger than the instinct to write things down.

So I stopped trying to make agents remember. Instead, I compiled the source context for them.

A 350-line Python script now reads Slack history, Obsidian docs, working-context files, feedback logs, and a unified inbox. Every 30 minutes, it compiles a single markdown file per agent so each session starts with the right operating context already loaded.

Here's how it works, why I built it, and what I'd do differently.

The Amnesia Problem

Here's the failure that broke me.

I told my sales agent that a prospect had moved companies. An hour later, the agent referenced him at the old company. The correction never made it to the source of truth. The next session started fresh, loaded the stale file, and the fact was gone.

This isn't a prompting problem. It's a structural one. Three causes:

No write-through. The agent heard the correction and kept talking. Never wrote it anywhere durable. By next session, the correction was in a compacted conversation summary, if it survived compaction at all.

Raw context doesn't fit. I wanted agents to read everything at session start: all Slack history, all Obsidian docs, all working-context files. I measured it. My personal ops agent (Clawd) would need ~1.7 million tokens. Context windows are 200K. You can't just "read everything."

Even "important" files get stale. Working-context files capture task state, not conversational facts. Pipeline.md gets updated after explicit pipeline reviews, not after casual mentions of job changes.

The rules-based approach, "always write it down", is necessary but insufficient. Agents follow rules about 80% of the time on a good day. For durable memory, you need infrastructure.

The Solution: A Boot Compiler

I built a compiler. Every 30 minutes, a Python script reads all the raw sources an agent needs and compiles them into a single markdown file: boot-payload.md. The agent reads this one file at session start and gets everything.

What Goes In

Source	What It Captures	Freshness
Slack history	Last 50 messages per owned channel	Real-time (last 7 days)
Thread expansion	Full replies for threads active in last 48 hours	Detail that top-level messages miss
Working context	Per-channel state files (YAML stripped to save tokens)	Updated by agent at milestones
Daily notes	Last 3 days of session summaries	Written by agents at session close
Feedback log	Recent corrections and behavioral patterns	Appended at each trigger point
Obsidian vault	Domain-specific docs (sales pipeline, product specs, etc.)	Updated by humans and agents
Unified inbox	Pending messages across email, WhatsApp, iMessage	Refreshed every 5 minutes

What Comes Out

A single file per agent:

Agent	Role	Payload Size	Tokens	Context Window Used
Clawd	Personal ops	306 KB	~78K	39% of 200K window
Alex	Sales & agency	167 KB	~43K	21%
Steve	Product	170 KB	~43K	21%

Every agent boots with the operating state it needs. 100-160K tokens left for actual work.

Thread Expansion: Where the Decisions Live

This is the detail most people miss.

Top-level Slack messages are often summaries: "[DONE] Updated Pipeline.md with ZR network leads." Useful, but thin. The actual decisions, who said what, what was corrected, what was agreed, live in the thread replies.

The boot compiler detects threads with activity in the last 48 hours and fetches the full reply chain directly from Slack's API. This is the difference between an agent that knows "the pipeline was updated" and one that knows "a prospect moved companies and we should update the lead angle."

Thread expansion is why the compiled context approach works better than just reading top-level channel history. The context that matters isn't in the headlines. It's in the replies.

Why Not RAG?

Retrieval-augmented generation sounds like the right answer: query a vector database for relevant context at each turn, rely on semantic search, and let embeddings decide what the agent needs to see.

But RAG has a fundamental problem for always-on agents: you don't know what you don't know.

If the sales agent doesn't know a prospect moved companies, it won't query for that change. RAG retrieves what you ask for. Compiled context gives you everything, whether you knew to ask or not.

RAG is great for targeted knowledge retrieval, "what's our pricing for enterprise?" It's terrible for ambient awareness, "what changed since yesterday?"

The distinction matters. A personal ops agent doesn't process a queue of known queries. It wakes up, reads the room, and responds to whatever comes. That requires ambient awareness. That requires everything up front.

Why Not a Database?

Databases are queryable but not readable. An agent can't "read" a Supabase table the way it reads a markdown file. You'd need an abstraction layer, an API, a query builder, a result formatter. That's engineering overhead for something that markdown handles natively.

The boot payload is a markdown file. The agent reads markdown. Zero abstraction layer. The compiler handles the complexity; the agent sees simplicity.

Why Markdown Specifically?

Prompt caching.

Anthropic's Thariq published a detailed breakdown of how Claude Code optimizes for prompt caching. The key insight: the API caches everything from the start of the request as a prefix match. Static content at the beginning gets cached. Dynamic content goes at the end. Any change in the prefix invalidates everything after it.

Our boot payload is the agent equivalent of Claude Code's CLAUDE.md. It's loaded once at session start, stays stable throughout the session, and gets the benefit of prompt caching across every API call within that session. If we rebuilt context on every turn (like RAG), we'd break the cache every time.

The architecture: compile once per 30 minutes, load once per session, cache across all turns.

For a session that makes 20+ API calls, typical for a multi-step task , this is the difference between paying full price 20 times and paying full price once. The first call is full price. Every subsequent call within the session reuses the cached prefix. The bigger the boot payload, the bigger the savings.

The Evolution: One Agent to Three

I wrote an earlier post arguing "one agent with full context beats ten agents with a coordination layer." I still believe that framing. But I learned something building this system: one agent per domain with compiled context beats one generalist agent trying to hold everything.

At 1.7M tokens raw, one agent couldn't hold all my context. I had two choices:

1. Build elaborate retrieval to selectively load context (RAG)
2. Split into domain specialists, each with a compiled payload that fits comfortably

I chose option 2. Three agents, each an expert in their domain:

Agent 1 (personal ops): multiple Slack channels, personal docs, operational records
Agent 2 (sales): pipeline, client docs, prospect research
Agent 3 (product): product specs, market research, strategy goals

Each agent gets the full context for its domain in one compiled payload, refreshed every 30 minutes. The important work is deciding what belongs in the packet before the session starts.

The coordination happens in Slack, the same way human teams coordinate, by posting to channels and @-mentioning each other. The boot compiler gives each agent the Slack history they need. Cross-domain awareness happens naturally through the message stream. Steve sees what Clawd posted to #general. Alex reads the thread where Steve discussed product positioning. No API. No database. Just Slack doing what Slack does.

The updated thesis: one agent per domain with compiled context > one generalist agent > ten fragmented agents with a coordination layer.

The key word is "domain." Not "task." You don't split agents by what they do, you split them by what they need to know. Each domain has a context boundary that fits in a compiled payload. That's the unit of decomposition.

The Write Side: Making Facts Stick

Compiled context solves the read side. The agent boots with everything it needs. But facts still need to make it into durable storage in the first place, otherwise the next compile pulls the same stale data.

This took three iterations to get right.

Iteration 1: Vague Rules (Failed)

"Log feedback after every decision point." This rule existed in every agent's config for two weeks. Result: zero self-logged entries. Every feedback log entry came from me running the logging script manually, or from the weekly self-audit. The agents "knew" the rule existed. They just never triggered it. The rule was too vague to act on.

Iteration 2: Explicit Trigger Tables (Working)

I replaced the vague rule with a table of non-negotiable triggers:

Trigger	Event Type	When
Ryan approves a draft	`draft_approved`	Immediately after sending
Ryan rejects a draft	`draft_rejected`	Before corrected version
Ryan corrects a fact	`preference_correction`	Before continuing
Ryan says you did something wrong	`behavior_correction`	Before responding
Agent discovers it was wrong	`self_correction`	At point of discovery
Session ends	`session_close`	Before ending session

The test is simple: if I gave feedback and the feedback log doesn't have a new entry, the rule failed. That specificity matters. "Log feedback" is a principle. "Call feedback-log.py --type preference_correction before your next response when Ryan corrects a fact" is an instruction.

Iteration 3: Cross-Agent Daily Review (Compounding)

A pipeline script runs every morning at 7 AM. It reads all three agents' feedback logs, classifies entries into tiers:

Auto-applied: Safe changes (contact routing, working-context updates)
Proposed: Config changes drafted by Claude Haiku, written to a review file for me
Flagged: Recurring behavioral patterns posted to Slack for human review

The compiled payload includes recent feedback entries, so each agent boots with awareness of recent corrections, including corrections that happened to other agents. Alex learns from Clawd's mistakes. Steve benefits from Alex's corrections. The feedback loop crosses agent boundaries because the compiler crosses agent boundaries.

The Self-Editing Trap

One lesson worth pulling out separately, because every agent builder will hit it.

Your agent will try to edit its own config files. It sounds like a feature. It's a bug.

We discovered that our agents were modifying their own identity and config files during self-improvement cycles. The result: files ballooned from 5KB to 30KB, rules contradicted each other, and the agents were spending their context window loading instructions they'd written for themselves about problems they no longer had.

The fix: lock all config files read-only. Add one rule: "If you identify a needed change, write it to a proposals file. The human reviews and applies."

The agent can still propose improvements. It just can't apply them. The human reviews, edits, and deploys config changes the same way you'd review a PR.

This matters because agents optimizing their own instructions create a feedback loop with no external check. Config file size directly impacts context window budget. Contradictory rules degrade response quality in ways that are hard to diagnose. The agent doesn't know which rules are load-bearing.

The pattern: agents write to memory/. Humans write to config files. Clean separation. No drift.

Implementation: ~350 Lines of Python

The boot compiler is a single Python script. No heavy dependencies, no dependencies beyond the standard library plus urllib for Slack API calls. It:

1. Reads a config dict mapping agents to their channels, Obsidian folders, and data sources
2. Fetches Slack history via CLI, with thread expansion via direct API calls
3. Reads Obsidian docs, working-context files (stripping YAML frontmatter to save tokens), and daily notes
4. Concatenates everything into sections with markdown headers
5. Writes boot-payload.md to each agent's memory directory
6. Logs compilation stats, KB, estimated tokens, duration

Total compile time for all three agents: ~60 seconds. Runs via macOS launchd every 30 minutes. The whole thing is a cron job.

The key engineering decision: the compiler runs on the host, not inside an agent session. It uses real filesystem access and direct API calls, no LLM inference, no token cost, no hallucination risk. The compiled payload is ground truth, not a summary. An LLM never touches the compilation process. It reads files, fetches API responses, concatenates strings, and writes a file. Boring. Reliable. Exactly what infrastructure should be.

The Progression: Prompts to Infrastructure

Zooming out, this is the arc I've watched every critical rule follow:

Level 1: Rule in AGENTS.md ("always check Harper's emails")
  ↓ Still gets forgotten

Level 2: Rule + feedback logging (track when agent forgets)
  ↓ Still gets forgotten, but you see the pattern

Level 3: Pre-flight validator (script blocks response if context wasn't loaded)
  ↓ Can't skip, code prevents it

Level 4: Boot query surfaces past failures before responding
  ↓ Agent is primed to avoid the error before it processes the message

Level 5: Declarative auto-refresh (YAML frontmatter + cron)
  ↓ Agent doesn't need to "remember" anything, the system pulls
    context automatically, detects what's new, and posts a briefing

Level 6: Compiled boot payload (boot-compiler.py + launchd)
  ↓ ALL sources compiled into one file every 30 min. Agent boots with
    full domain awareness. No retrieval, no queries, no luck.

Most community setups are at Level 1. Getting to Level 3 is where quality actually changes. Level 5 is where you stop fighting the agent's memory and start building infrastructure that makes memory irrelevant. Level 6 is where you stop pretending the agent will load context correctly and compile it externally.

The pattern mirrors something the engineering world already learned: infrastructure-as-code beats runbooks. You don't SSH into a server and run commands from a checklist. You declare the desired state and let the system converge. Same idea, applied to agent context management.

What I'd Do Differently

Start with the compiler, not the rules. I spent weeks writing markdown rules that agents ignored. The compiler would have solved 80% of the memory problem on day one. If you're starting an agent project today, build the compiler first.

Thread expansion from the start. Top-level Slack messages miss the detail. Thread replies are where decisions, corrections, and context live. I should have been pulling thread replies from day one.

Smaller, more frequent compiles. 30 minutes is fine for most workflows. For high-velocity channels, active sales conversations, live debugging sessions, 10-minute intervals would reduce the staleness window.

Include the feedback log in the payload from the start. Agents boot with awareness of recent corrections. This is cheap (a few KB) and high-leverage. Every agent should see what went wrong recently, even if the mistake happened to a different agent.

My Stack

Infrastructure:
  Scheduled tasks: boot compiler (periodic), daily review, inbox sync

Per agent:
  Identity files   (read-only, locked, operational rules, personality, tools)
  Compiled payload (auto-refreshed periodically)
  Feedback log     (append-only, structured entries)
  Working context  (per-channel state, refreshed each cycle)
  Session memory   (cross-session learnings + daily summaries)

Coordination: Slack (channels, threads, @-mentions)
Long-term memory: Obsidian vault (markdown files, human-maintained)

Total custom code: ~700 lines Python
Total config: ~1,500 lines markdown (across 3 agents)
Database: none
React dashboard: none
Vector DB: none

The Takeaway

Stop writing rules for agents to follow. Start building infrastructure that makes the rules unnecessary.

The agent memory problem isn't a prompting problem. It's a data engineering problem. Where do facts live? How do they get there? How does the agent access them at session start without knowing what to ask for?

Compiled context is the answer I've found. A script runs on a cron, reads every source of truth, and produces the file the agent reads at boot. The agent starts with the relevant business context in view, and that same payload stays cacheable across every API call in the session.

700 lines of Python. Three agents with their operating context already assembled. Zero databases.

Boring infrastructure beats clever prompting, every time.

← Back to Build Log