Compiled Context: How I Stopped Fighting Agent Memory and Started Engineering It
I spent three weeks building rules for my AI agents to remember things.
"Write it down when Ryan tells you a fact." "Check your context before asking questions." "Log every correction."
The agents ignored all of it. Not maliciously — structurally. Session memory is ephemeral. Context windows have limits. And the instinct to apologize and move on is stronger than the instinct to write things down.
So I stopped trying to make agents remember. Instead, I compiled their memory for them.
A 350-line Python script now reads Slack history, Obsidian docs, working-context files, feedback logs, and a unified inbox — then compiles a single markdown file per agent every 30 minutes. Three agents boot with full situational awareness. No retrieval. No queries. No luck.
Here's how it works, why I built it, and what I'd do differently.
The Amnesia Problem
Here's the failure that broke me.
I told my sales agent, Alex, that a prospect named John Jorgensen had moved from MaintainX to Opendoor. An hour later, Alex referenced him as "VP Demand Generation at MaintainX." The correction never made it to the source of truth. The next session started fresh, loaded the stale file, and the fact was gone.
This isn't a prompting problem. It's a structural one. Three causes:
No write-through. The agent heard the correction and kept talking. Never wrote it anywhere durable. By next session, the correction was in a compacted conversation summary — if it survived compaction at all.
Raw context doesn't fit. I wanted agents to read everything at session start: all Slack history, all Obsidian docs, all working-context files. I measured it. My personal ops agent (Clawd) would need ~1.7 million tokens. Context windows are 200K. You can't just "read everything."
Even "important" files get stale. Working-context files capture task state, not conversational facts. Pipeline.md gets updated after explicit pipeline reviews, not after casual mentions of job changes.
The rules-based approach — "always write it down" — is necessary but insufficient. Agents follow rules about 80% of the time on a good day. For durable memory, you need infrastructure.
The Solution: A Boot Compiler
I built a compiler. Every 30 minutes, a Python script reads all the raw
sources an agent needs and compiles them into a single markdown file:
boot-payload.md. The agent reads this one file at session
start and gets everything.
What Goes In
| Source | What It Captures | Freshness |
|---|---|---|
| Slack history | Last 50 messages per owned channel | Real-time (last 7 days) |
| Thread expansion | Full replies for threads active in last 48 hours | Detail that top-level messages miss |
| Working context | Per-channel state files (YAML stripped to save tokens) | Updated by agent at milestones |
| Daily notes | Last 3 days of session summaries | Written by agents at session close |
| Feedback log | Recent corrections and behavioral patterns | Appended at each trigger point |
| Obsidian vault | Domain-specific docs (sales pipeline, product specs, etc.) | Updated by humans and agents |
| Unified inbox | Pending messages across email, WhatsApp, iMessage | Refreshed every 5 minutes |
What Comes Out
A single file per agent:
| Agent | Role | Payload Size | Tokens | Context Window Used |
|---|---|---|---|---|
| Clawd | Personal ops | 306 KB | ~78K | 39% of 200K window |
| Alex | Sales & agency | 167 KB | ~43K | 21% |
| Steve | Product | 170 KB | ~43K | 21% |
Every agent boots with full situational awareness. 100-160K tokens left for actual work.
Thread Expansion: Where the Decisions Live
This is the detail most people miss.
Top-level Slack messages are often summaries: "[DONE] Updated Pipeline.md with ZR network leads." Useful, but thin. The actual decisions — who said what, what was corrected, what was agreed — live in the thread replies.
The boot compiler detects threads with activity in the last 48 hours and fetches the full reply chain directly from Slack's API. This is the difference between an agent that knows "the pipeline was updated" and one that knows "John Jorgensen moved from MaintainX to Opendoor and we should update his lead angle."
Thread expansion is why the compiled context approach works better than just reading top-level channel history. The context that matters isn't in the headlines. It's in the replies.
Why Not RAG?
Retrieval-augmented generation sounds like the right answer. Query a vector DB for relevant context at each turn. Semantic search. Embeddings. The whole stack.
But RAG has a fundamental problem for always-on agents: you don't know what you don't know.
If Alex doesn't know Jorgensen moved companies, Alex won't query for "Jorgensen company change." RAG retrieves what you ask for. Compiled context gives you everything, whether you knew to ask or not.
RAG is great for targeted knowledge retrieval — "what's our pricing for enterprise?" It's terrible for ambient awareness — "what changed since yesterday?"
The distinction matters. A personal ops agent doesn't process a queue of known queries. It wakes up, reads the room, and responds to whatever comes. That requires ambient awareness. That requires everything up front.
Why Not a Database?
Databases are queryable but not readable. An agent can't "read" a Supabase table the way it reads a markdown file. You'd need an abstraction layer — an API, a query builder, a result formatter. That's engineering overhead for something that markdown handles natively.
The boot payload is a markdown file. The agent reads markdown. Zero abstraction layer. The compiler handles the complexity; the agent sees simplicity.
Why Markdown Specifically?
Prompt caching.
Anthropic's Thariq published a detailed breakdown of how Claude Code optimizes for prompt caching. The key insight: the API caches everything from the start of the request as a prefix match. Static content at the beginning gets cached. Dynamic content goes at the end. Any change in the prefix invalidates everything after it.
Our boot payload is the agent equivalent of Claude Code's
CLAUDE.md. It's loaded once at session start, stays stable
throughout the session, and gets the benefit of prompt caching across
every API call within that session. If we rebuilt context on every turn
(like RAG), we'd break the cache every time.
The architecture: compile once per 30 minutes, load once per session, cache across all turns.
For a session that makes 20+ API calls — typical for a multi-step task — this is the difference between paying full price 20 times and paying full price once. The first call is full price. Every subsequent call within the session reuses the cached prefix. The bigger the boot payload, the bigger the savings.
The Evolution: One Agent to Three
I wrote an earlier post arguing "one agent with full context beats ten agents with a coordination layer." I still believe that framing. But I learned something building this system: one agent per domain with compiled context beats one generalist agent trying to hold everything.
At 1.7M tokens raw, one agent couldn't hold all my context. I had two choices:
1. Build elaborate retrieval to selectively load context (RAG)
2. Split into domain specialists, each with a compiled payload that fits
comfortably
I chose option 2. Three agents, each an expert in their domain:
Clawd (personal ops): 7 Slack channels, Obsidian
personal docs, financial records, real estate files
Alex (sales & agency): 5 channels, sales pipeline,
client docs, prospect research
Steve (product): 3 channels, product specs, market
research, strategy goals
Each agent has full context for their domain. No retrieval gaps. No coordination layer. No shared database. Just three compiled payloads, refreshed every 30 minutes.
The coordination happens in Slack — the same way human teams coordinate. By posting to channels and @-mentioning each other. The boot compiler gives each agent the Slack history they need. Cross-domain awareness happens naturally through the message stream. Steve sees what Clawd posted to #general. Alex reads the thread where Steve discussed product positioning. No API. No database. Just Slack doing what Slack does.
The updated thesis: one agent per domain with compiled context > one generalist agent > ten fragmented agents with a coordination layer.
The key word is "domain." Not "task." You don't split agents by what they do — you split them by what they need to know. Each domain has a context boundary that fits in a compiled payload. That's the unit of decomposition.
The Write Side: Making Facts Stick
Compiled context solves the read side. The agent boots with everything it needs. But facts still need to make it into durable storage in the first place — otherwise the next compile pulls the same stale data.
This took three iterations to get right.
Iteration 1: Vague Rules (Failed)
"Log feedback after every decision point." This rule existed in every agent's config for two weeks. Result: zero self-logged entries. Every feedback log entry came from me running the logging script manually, or from the weekly self-audit. The agents "knew" the rule existed. They just never triggered it. The rule was too vague to act on.
Iteration 2: Explicit Trigger Tables (Working)
I replaced the vague rule with a table of non-negotiable triggers:
| Trigger | Event Type | When |
|---|---|---|
| Ryan approves a draft | draft_approved | Immediately after sending |
| Ryan rejects a draft | draft_rejected | Before corrected version |
| Ryan corrects a fact | preference_correction | Before continuing |
| Ryan says you did something wrong | behavior_correction | Before responding |
| Agent discovers it was wrong | self_correction | At point of discovery |
| Session ends | session_close | Before ending session |
The test is simple: if I gave feedback and the feedback log doesn't have
a new entry, the rule failed. That specificity matters. "Log feedback" is
a principle. "Call feedback-log.py --type preference_correction
before your next response when Ryan corrects a fact" is an instruction.
Iteration 3: Cross-Agent Daily Review (Compounding)
A pipeline script runs every morning at 7 AM. It reads all three agents' feedback logs, classifies entries into tiers:
Auto-applied: Safe changes (contact routing,
working-context updates)
Proposed: Config changes drafted by Claude Haiku, written
to a review file for me
Flagged: Recurring behavioral patterns posted to Slack
for human review
The compiled payload includes recent feedback entries, so each agent boots with awareness of recent corrections — including corrections that happened to other agents. Alex learns from Clawd's mistakes. Steve benefits from Alex's corrections. The feedback loop crosses agent boundaries because the compiler crosses agent boundaries.
The Self-Editing Trap
One lesson worth pulling out separately, because every agent builder will hit it.
Your agent will try to edit its own config files. It sounds like a feature. It's a bug.
We discovered that our agents were modifying their own AGENTS.md, SOUL.md, TOOLS.md, and USER.md during self-improvement cycles. The result: files ballooned from 5KB to 30KB, rules contradicted each other, and the agents were spending their context window loading instructions they'd written for themselves about problems they no longer had.
The fix:
# Lock all config files
chmod 444 AGENTS.md SOUL.md TOOLS.md USER.md
Add one rule to AGENTS.md: "If you identify a needed rule change, write
it to memory/proposed-config-changes.md. Ryan reviews and
applies."
The agent can still propose improvements. It just can't apply them. The human reviews, edits, and deploys config changes the same way you'd review a PR.
Why it matters:
Agents optimizing their own instructions is a feedback loop with no external check. Config file size directly impacts context window budget. Contradictory rules degrade response quality in ways that are hard to diagnose. The agent doesn't know which rules are load-bearing.
The pattern: agents write to memory/. Humans write
to config files. Clean separation. No drift.
Implementation: ~350 Lines of Python
The boot compiler is a single Python script. No frameworks, no dependencies beyond the standard library plus urllib for Slack API calls. It:
1. Reads a config dict mapping agents to their channels, Obsidian
folders, and data sources
2. Fetches Slack history via CLI, with thread expansion via direct API
calls
3. Reads Obsidian docs, working-context files (stripping YAML frontmatter
to save tokens), and daily notes
4. Concatenates everything into sections with markdown headers
5. Writes boot-payload.md to each agent's memory
directory
6. Logs compilation stats — KB, estimated tokens, duration
Total compile time for all three agents: ~60 seconds. Runs via macOS launchd every 30 minutes. The whole thing is a cron job.
The key engineering decision: the compiler runs on the host, not inside an agent session. It uses real filesystem access and direct API calls — no LLM inference, no token cost, no hallucination risk. The compiled payload is ground truth, not a summary. An LLM never touches the compilation process. It reads files, fetches API responses, concatenates strings, and writes a file. Boring. Reliable. Exactly what infrastructure should be.
The Progression: Prompts to Infrastructure
Zooming out, this is the arc I've watched every critical rule follow:
Level 1: Rule in AGENTS.md ("always check Morgan's emails")
↓ Still gets forgotten
Level 2: Rule + feedback logging (track when agent forgets)
↓ Still gets forgotten, but you see the pattern
Level 3: Pre-flight validator (script blocks response if context wasn't loaded)
↓ Can't skip — code prevents it
Level 4: Boot query surfaces past failures before responding
↓ Agent is primed to avoid the error before it processes the message
Level 5: Declarative auto-refresh (YAML frontmatter + cron)
↓ Agent doesn't need to "remember" anything — the system pulls
context automatically, detects what's new, and posts a briefing
Level 6: Compiled boot payload (boot-compiler.py + launchd)
↓ ALL sources compiled into one file every 30 min. Agent boots with
full domain awareness. No retrieval, no queries, no luck. Most community setups are at Level 1. Getting to Level 3 is where quality actually changes. Level 5 is where you stop fighting the agent's memory and start building infrastructure that makes memory irrelevant. Level 6 is where you stop pretending the agent will load context correctly and compile it externally.
The pattern mirrors something the engineering world already learned: infrastructure-as-code beats runbooks. You don't SSH into a server and run commands from a checklist. You declare the desired state and let the system converge. Same idea, applied to agent context management.
What I'd Do Differently
Start with the compiler, not the rules. I spent weeks writing markdown rules that agents ignored. The compiler would have solved 80% of the memory problem on day one. If you're starting an agent project today, build the compiler first.
Thread expansion from the start. Top-level Slack messages miss the detail. Thread replies are where decisions, corrections, and context live. I should have been pulling thread replies from day one.
Smaller, more frequent compiles. 30 minutes is fine for most workflows. For high-velocity channels — active sales conversations, live debugging sessions — 10-minute intervals would reduce the staleness window.
Include the feedback log in the payload from the start. Agents boot with awareness of recent corrections. This is cheap (a few KB) and high-leverage. Every agent should see what went wrong recently, even if the mistake happened to a different agent.
My Stack
Infrastructure:
macOS launchd → boot-compiler.py (every 30 min)
macOS launchd → daily-review.py (daily 7 AM)
macOS launchd → inbox-sync (every 5 min)
Per agent:
AGENTS.md (read-only, chmod 444, operational rules)
SOUL.md (read-only, chmod 444, personality)
TOOLS.md (read-only, chmod 444, tool guidance)
USER.md (read-only, chmod 444, user facts)
memory/boot-payload.md (compiled, auto-refreshed every 30 min)
memory/feedback-log.jsonl (append-only, structured entries)
memory/working-context/*.md (per-channel state)
memory/MEMORY.md (cross-session learnings)
memory/YYYY-MM-DD.md (daily session summaries)
scripts/feedback-log.py (logging tool called by agent)
Coordination: Slack (channels, threads, @-mentions)
Long-term memory: Obsidian vault (markdown files, human-maintained)
Total custom code: ~700 lines Python
Total config: ~1,500 lines markdown (across 3 agents)
Database: none
React dashboard: none
Vector DB: none The Takeaway
Stop writing rules for agents to follow. Start building infrastructure that makes the rules unnecessary.
The agent memory problem isn't a prompting problem. It's a data engineering problem. Where do facts live? How do they get there? How does the agent access them at session start without knowing what to ask for?
Compiled context is the answer I've found. A script that runs on a cron, reads every source of truth, and produces a single file the agent reads at boot. No retrieval. No queries. No luck. Just a file that contains everything the agent needs to know, refreshed every 30 minutes, cached across every API call in the session.
700 lines of Python. Three agents with photographic recall. Zero databases.
Boring infrastructure beats clever prompting, every time.