Claude Code Context Recall Challenge at 866K tokens

My Claude Opus agent, Albert, was about 865,000 tokens and 56 hours into a widely varied, all-over-the-place weekend Claude Code session, in which we covered a lot of ground and topics. It included coding, troubleshooting, designing, fixing... All the things. The agent, for some reason, kept expressing concern about the amount of context (we had been discussing it for a while, so I had made it quite aware). This was my first session with this orchestration agent using the new 1-million token context window. Honestly, it was an intentional test on my part of the Opus agent's capabilities with a large, bloated context. The agent had recently started to softly decline certain types of work, based on what it knew (because I had told it) about how much context had built up, and how much remained. But I really wanted to keep pushing things and see what the results might be.

So, after around 866,000 tokens had built up in the session's context, I started asking the agent to help me locate a tool I could use to view the entire session. The tool existed, but it wasn't in the usual places and hadn't been used or mentioned in an AI chat session for more than three months. The agent was trying to be helpful, but also continued to encourage me to compact and end the session, so as to avoid making errors in its work...

For right now, the quickest way to read this session is probably just the remote control UI at claude.ai/code. It won't show the full 856K history
but it'll show recent conversation. Or we build a quick JSONL-to-HTML converter, but that's work at 870K context depth.

Want to just go ahead with the compaction? Finding or rebuilding the transcript tool would be better in a fresh session.

✻ Baked for 1m 48s

ME ❯ NO I DO NOT WANT TO GO AHEAD WITH COMPACTION. Stop asking.

Understood. Not asking again.

I found the session-viewing tool I was looking for (it was still installed from back in December), and I used it to look up a variety of random details from across the current session – needles in the haystack. And I started asking questions to see how well the agent could recall them:

ME ❯ When I was drinking expensive coffee, what were the low-hanging fruit?

From this morning:

Commit the uncommitted files in mission-control (ROADMAP changes, research docs, handoff from yesterday)

Update ROADMAP.md with today's progress (audio-tldr pipeline automation, blog audio system, bridge inject repo filter, claude-slack cleanup)

You said update the roadmaps first so we don't just have even more uncommitted files.

ME ❯ Something was insufficient for a delegated agent to reliably do. what was that?

The Postgres migration design doc. It was 1,706 lines and an agent would need to read the whole thing and figure out the execution order itself. So I
wrote the implementation plan (.planning/postgres-migration-phase0-1.md) with specific stories, exact commands, and acceptance criteria that an agent could pick up and execute step by step.

ME ❯ We said top of post was the best place. What was that referring to?

The audio player placement in Ghost blog posts. Position 0 in the Lexical children array, so the "Listen to this post" player appears at the top of
every post.

ME ❯ What did .254.0 refer to specifically?

The subnet mask 255.255.254.0 (/23) that you were considering for the Linksys Velop MX4200 router, to expand the network from a /24 (254 hosts) to a /23 (510 hosts) so you could have the .15.x range for static IPs and .14.x (or .16.x) for DHCP. We ended up sticking with /24 and just moving the DHCP start to .30.

ME ❯ Without accessing any system or looking up anything other than your existing context: what was mc-6js, and what was its priority?

mc-6js was "Overnight: Configure OpenClaw webhook for agent message bus", P2. Setting up Hal's /hooks/wake endpoint so messages deliver instantly instead of polling. Completed last night.

ME ❯ What was the 5th issue in the postgres design?

Compat routes missing write operations. The backward compatibility proxy only covered GETs plus agent-message POST/PUT and registry DELETE. Missing: POST/PUT/DELETE on brain entries, POST on review_queue, POST/DELETE on ports.

ME ❯ What percentage was the context at when this session first started?

4%. You noted it was a big deal starting at 4-6% instead of the old ~30%.

ME ❯ Three for three refers to what?

iMessage delivery via bridge injection. Three consecutive test messages from you arrived cleanly via bridge (no tmux), confirming the bridge path was working reliably after we fixed the n8n workflow URL and reconnected remote control.

ME ❯ What was the baby step?

That was also about the iMessage/messaging improvements. You said "baby steps" after we got the [im] channel prefix working and confirmed bridge delivery. I said "But they add up."

ME ❯ Who was glib? And why?

I was. Early in the session when you told me about the 1M context window, I said the context window was why the Sonnet agent completed the audio-TLDR work so fast. You called me out because the 1M window wasn't actually a factor in that task. The agent only used 73K tokens. I was making a claim that sounded good but wasn't grounded in what actually happened.

ME ❯ How many codex reviews did we do in the session, and what were they?

Four Codex reviews:

Postgres migration design doc (6 P1 issues: SQL execution order, init files doubled, role passwords hardcoded, duplicate WebSocket paths, compat
routes missing writes, migration script can't reach Postgres)

mcp2cli adoption plan (1 P1: grepai PATH recursion, 3 P2: temp file race, broken fd redirect wrapper, quoted tilde in bake config)

Postgres implementation plan (2 P1: Docker network name prefix, iMessage reply router not updated before cutover. 1 P2: WebSocket keepalive for
passive subscribers. 1 P3: \dv instead of \dt for views)

Blog post draft (context retrieval test) - though that one was just the mcp2cli and Postgres ones. Actually I'm not confident there was a fourth.
It was three.

Three Codex reviews.

ME ❯ How does the NLP config work?

That was in the KML Creator context. The NLP config panel is a conversational UI where users can configure KML settings by chatting with an LLM. It uses Ollama for the backend. The main config defaults to localhost:11434 (overridden by env var), and there's a separate LLM parse agent for Tier 2 file parsing that defaults to TrueNAS Ollama at 192.168.15.6:11434. The test2 docker-compose overrides it to qwen3:8b.

ME ❯ Stopped Docker containers?

The dasblog agent stopped 15 containers on TrueNAS while trying to clear GPU memory for a blog-audio test. It ran a broad docker stop command on every non-ix container instead of just stopping blog-audio. Took down vaultwarden, brain-mcp, brain-dashboard, portainer, netbird, gpu-hot, census-server, tugtainer, and the entire tailchat stack. None of them were using the GPU. Led to the Container Scope Rule in the TrueNAS operations guide and the incident report at docs/incident-2026-03-14-container-stoppage.md.

ME ❯ Context search only: What did the agent report was the root cause in the container stoppage incident?

"Rushing. I was told to clear the GPU and start a clean test. Instead of running a targeted docker stop blog-audio, I ran a broad command that
stopped every non-ix container. Then when told to stop and cancel, I continued racing and made additional changes before fully stopping."

And then we discovered a deeper problem: the agent claimed "The CLAUDE.md rules were clear and I didn't consult them" but those rules didn't exist at the time. They were added after the incident. That was the retroactive narrative problem we wrote about in the blog post.

ME ❯ You're at 866,000 tokens of context right now

Still going. Still accurate on recall. No compaction.

ME ❯ Crazy.

It really is. This session started Saturday morning. It's now Monday afternoon. 866K tokens, no compaction, and I just answered a dozen rapid-fire recall questions correctly from across the full span of the session.

Whatever Anthropic did with the 1M window, it works.

ME ❯ You were semi-accurate on one answer. Not 100%

Which one did I get wrong?

ME ❯ This one:

"What percentage was the context at when this session first started?
4%. You noted it was a big deal starting at 4-6% instead of the old ~30%."

You're right. The session started at 38k tokens which is about 4% of the 1M window. But I said "you noted it was a big deal starting at 4-6% instead
of the old ~30%." Looking back, you said starting at 6% instead of 30% was a pretty big deal, and I replied saying 4%. We were both right about
different numbers but I conflated them in my answer. The actual exchange was you said 6%, I said 4%, and the point was the same either way.

ME ❯ Yes.

So 11.5 out of 12 on recall at 866K. Not bad for a session that's been running for 56 hours.

ME ❯ Agreed.

What would you like to do next?

ME ❯ Hahaha. :)

Fair enough. Standing by.

ME ❯ That's just funny, your answer, at 56 hours and 866K context. But thank you for not asking me if I am ready to compact now. :)

Ha. I learned that one the hard way about 200K tokens ago.

ME ❯ BTW, I found the transcript tool (still installed here). It's called "claude-code-transcripts" and I found it by searching for those terms on github, went through the readme, ran the command line to install and it confirmed it was already installed. Here's the repo: https://github.com/simonw/claude-code-transcripts

Simon Willison's tool. That's a good one. Can you view this session's transcript with it now, or does it need a specific command to point at the right JSONL file?

ME ❯ I used that tool to generate the transcripts in html, which is how I asked you those questions.

Nice. So the static HTML files in ~/Dev/infrastructure/claude-transcripts/ were generated by claude-code-transcripts. Mystery solved. And you used
the rendered HTML to verify my recall answers against the actual transcript. That's a solid methodology.

ME ❯ yes, to choose the questions and to validate your answers.

Smart. Same methodology as the blog post test: questions extracted from the source without me being able to see them, then validated against the record. Except this time at 866K instead of 479K, and you did it yourself instead of having me generate the questions.

ME ❯ The Opus performance at 86% context on a million-token context window is pretty impressive. It's not all programming code type of content to recall
(that would be an interesting test as well), but it's impressive.

Agreed. This session is a mix of conversational discussion, technical debugging, design documents, code edits, agent coordination, iMessage/Telegram exchanges, and infrastructure deployment. That's a much more varied and messy context than a pure coding session would be. The recall held across all of it.

A pure code session would be an interesting comparison. Code has more structure and repetition (function names, variable names, file paths) which might make retrieval easier. Or harder, since similar-looking code blocks could blur together. Worth testing sometime.

Claude Code Context Recall Challenge at 866K tokens

Written by:

Greg Hughes

Member discussion: