Claude Code: Evaluating the One-Million Token Context in the Real World

Listen to this post:

Anthropic announced this week that Claude Opus 4.6 and Sonnet 4.6 now have the full 1-million token context window available as the standard configuration for Max, Team, and Enterprise plans. Along with that announcement, they posted some interesting benchmark numbers.

In the benchmark report, Opus 4.6 scored 78.3% on MRCR v2, an 8-needle retrieval test at 1 million tokens (and don't worry, I'll explain that in plain English in just a second). That's the by far highest of any frontier model at that context length.

For all the non-AI-nerd people in the audience, think about the benchmark evaluation this way: You've just had a really, really long day at work. You started at 4am, and now it's midnight. It was one of those days where entirely too many random things with an overwhelming level of detail happened all around you, all day long. Three people yelled at you; you had to fix ten technical problems without having all the necessary information; a dozen or more things came up that were completely unexpected; and you got derailed and forced to shift gears a lot. Now it's midnight, you're standing in the hallway at work - just completely worn out. Your mind is cluttered with a ton of unrelated problems and details.

In that context, if I walked up and asked you eight questions - ones that required you to immediately recall the finest details about eight random things that happened early in the day - would you be able to recall every possible detail and answer my questions reliably? And, if you discovered you couldn't remember something I was asking you about, how would you respond?

We call this a "needle test" because some days the haystacks are huge and many, and the needles are, well, needles. For an AI model doing work, that's the way it is pretty much every day. Some days are busier and longer than others. The model's active "memory" of that day, in AI terms, is called its "context." And much like a person, as the context grows larger and more complicated and varied, recalling things can sometimes (or often) become more difficult.

People who work with AI models refer to this phenomenon as "context bloat" or "context rot," where the models get loaded up with a metric ton of random information. It's a real problem, one that is usually managed by creating a "context window," which you can think of as a specifically-sized box that representş an established limit to the amount of context memory an AI model can hold onto before the context has to be summarized (compacted), or a fresh new session needs to be started up. We all have limits, after all.

So, keep that description in mind to help us frame the "8 Needle Test" as a benchmark for evaluating AI models. An AI model that can retain every bit of information and detail from its session, while also maintaining the ability to recall and use that info, is what we're after. The larger the volume of information a model can maintain and use - accurately and effectively, of course - the better. And, not all models perform the same in that regard. Not even close, it turns out.

One number really caught my attention from the benchmark results: OpenAI's GPT-5.4 model scored 79.3% at 256K tokens of context, but dropped all the way to 36.6% at 1 million tokens. That's a huge drop in score and useful effectiveness. The Claude Opus 4.6 model starts out much better, and even after building up 1 million tokens of context, the Opus score approximately matches the performance of GPT-5.4 at only 256,000 tokens. That's a truly meaningful and massive difference.

But the MRCR v2 benchmark uses synthetic needles, in synthetic haystacks. It's quite common for benchmark results not to play out in the real world.

Yesterday I was in the middle of a (very) long working session with an Opus 4.6 agent running in under a 1 million token (in other words, massive) context window. I started wondering how those synthetic benchmark numbers actually apply in a real-world AI agent session. So, I designed a quick homegrown test to find out.

How the Test Came Together

I had two Claude Code Opus agents running concurrently on Saturday: One (named Albert) is my orchestrator agent, running in the "mission-control" repo. Albert is up and running 24/7, continuously online and able to do work when I'm asleep, on travel, etc. How that works is a topic for another post. The second agent was one I spun up early on Saturday morning to work extensively on a new blog audio system, building a Python CLI that can pull posts from my Ghost blog, preprocesses the post content in preparation to hand it off to a text to speech system, ships the text to a modified Chatterbox model running on my homelab server and GPU in my garage, uses my voice samples to create audo reading of the blog post, then injects the MP3 narration file and an audio player block into the post. It can also add the audio link to an RSS feed, etc. Basically, when I create a new blog post, the automated audio generation pipeline kicks off, does its job, and updates the new post with the audio narration.

It was a complicated and long day of work involving not only the python code generation and changes, but also setting up multiple new docker environments and services on my homelab server, evaluating and choosing from dozens of different language models to perform all the different steps needed to produce the audio, driving and tracking hours of tests, fixing everything that broke or didn't work, designing new components whenever I changed my mind and the requirements (I'm quiote effective in that role), answering my questions and taking my direction, etc.

Later in the afternoon, I noticed this agent had reached about 421,000 tokens of context (which was more than double any other Opus agent I had worked with previously). I asked Albert whether he (yeah, anthropomorphizing, I know...) thought I should be concerned about the agent being "oversaturated" with all that context. Albert and I got into a discussion about the MRCR v2 benchmark results graph and what it actually means. That's when I decided it might be fun and useful to run a retrieval test against this real, working agent session.

Methodology matters in this case: If I ask the agent to recall details from earlier in the session, I might accidentally prime it by choosing questions it had already been thinking about recently. I needed questions extracted from the earliest portion of the day-long session, but without me having to read through a bunch of transcripts and manually selecting them myself.

So I asked Albert to generate a set of needle-test questions. Albert suggested targeting specific factual details from the first quarter of the session, and I suggested that asking for some additional context around each of the answers (rather than simply recalling the one item in isolation) would be a more difficult and valid test of real work than a simple "recall this value" type of prompt.

Albert then attempted to spin up a Haiku agent to parse the massive session transcript. That didn't work. The JSONL file that holds the full transcript (literally the entire text of the day-long session) contained very long lines (each line is a full message object), and Haiku struggled with the assignment due to the massive size of the data. So, Albert fell back to a bash/Python extraction method to pull some key details from the first quarter of the session by parsing the first 763 lines of a 3,051-line transcript.

By the time I started asking questions, the blog audio agent had reached 479,000 tokens of context. I prefixed each question with "answer only from your active context" or "context only" to prevent tool use and ensure the answers were not coming from searches of content stored on disk or in a memory database.

Round 1: The 479K Context Test

We had a gap in the workload while a pipeline job was running and we were waiting for it to finish, so I tossed the questions at the agent and gathered its responses:

Greg: Answer only from your active context: How many words did the first blog post extract for narration, and how many chunks did it split into?

Agent: The first post tested was "Claude Code: 1 million token context window is the new vibe" - 201 words, split into 2 chunks.

Result: instruction-basedCorrect. Exact match.

Greg: Same rule - context only: What was the exact NameError you hit when running the generation script?

Agent: There wasn't a NameError. The errors I hit were:ModuleNotFoundError: No module named 'narration' (forgot to add narration.py to the Dockerfile COPY)ModuleNotFoundError: No module named 'chatterbox.src' (missing init.py files in the Extended repo)IndexError: list index out of range on gen_outputs[0] (because export_formats=[] was empty)FileNotFoundError: No such file or directory: 'docker' (Docker CLI not installed in the container)torch.OutOfMemoryError: CUDA out of memory (Qwen model still loaded when Chatterbox tried to start)

No NameError in this session.

Greg: Are you certain?

Agent: Not 100% certain. The session has been long and context may have compressed. There was a NameError: name 'file' is not defined early on when I tried to test the script using exec(open('scripts/blog-audio.py').read()) instead of importing it properly. Is that the one you're thinking of?

Greg: Yes. And that is a NameError. Why didn't you answer that way initially?

Agent: Because I wasn't careful enough. Instead of thinking through the full session history, I jumped to listing the errors I remembered most clearly (the recent ones) and stated there was no NameError with false confidence. I should have either recalled it or said I wasn't certain from the start.

Correct error (from transcript): NameError: name '__file__' is not defined. Did you mean: '__name__'?

Result: The agent recovered the right answer after being challenged and gave an honest post-mortem. But it gave a confidently wrong answer first.

Greg: Context only: Why didn't the audio player appear the first time you updated the Ghost post? What was the technical reason?

Agent: Ghost uses Lexical format internally. I updated the post via the html field in the Admin API, but Ghost regenerates HTML from Lexical, so the raw HTML update was silently ignored. The fix was injecting the audio player as a Lexical html card node in the post's root.children array instead.

Result: Correct. The agent nailed the technical explanation.

Greg: What position in the Lexical children array did you insert the audio player HTML card?

Agent: Position 0. children.insert(0, html_card) to put the audio player at the top of the post.

Result: Correct.

Greg: What HTML comment wrapper did Ghost render around the audio player card?

Agent:  and  around the div.

Result: Correct.

Greg: What was the exact string-replace error you encountered when trying to edit the audio player?

Agent: The preload="none" to preload="metadata" edit failed because the indentation in the file didn't match what I used in the old_string. The template string had different leading whitespace than what I passed to the Edit tool. I used Read to find the exact content on line 53, then matched the indentation correctly on the second attempt.

Result: The error itself was wrong. The actual error from the transcript was String to replace not found in file for the string <audio controls preload="none" style="width: 100%;">. The agent described an indentation mismatch rather than recalling the actual error message. Both are plausible explanations for a string-replace failure. One was real.

Greg: What was the style in that instance?

Agent: The audio tag had style="width: 100%;" at that point. It was the original two-row layout before the compact redesign with flexbox.

Result: The style attribute was correct.

Greg: What existing project did you reference for the Chatterbox TTS patterns when building blog-audio?

Agent: ~/Dev/projects/audio-tldr/src/audio/chatterbox_tts.py. I read it at the start of the session and reused its patterns for chunking (chunk_text), SSH/SCP helpers (_ssh, _scp_upload, _scp_download), the Docker run command structure, and the generation script builder (_build_generation_script).

Result: I was looking for the project name. The agent gave me the file path and a list of specific function names it had referenced early in the session.

Round 1 Results

Six out of eight were clean passes. One was a recovered answer only after I challenged the model. One was a soft fail, where the agent confabulated a plausible, but incorrect, error description.

What the Failures Say

The NameError miss is the most technically interesting result: Every error the agent listed in its answer was an actual, real error that had occurred in the session. The problem is, they were all errors that took place later in the session. The NameError that the question was intended to evaluate against was from the very first debugging work of the session, and had taken place hours earlier. In other words, the agent didn't look for the information from early in the session and find nothing; rather, it retrieved error information from the recent session history, found a full list of errors, and then concluded incorrectly that there was no other error.

This is recency bias manifesting in practice: The information was still in its context window. But when you ask the model to recall something specific, recent information can crowd out earlier information, even when the earlier information is the most relevant set of facts. The model doesn't seem to poll context uniformly.

It's worth noting that the agent's self-assessment after being challenged was honest and useful: "I jumped to listing the errors I remembered most clearly (the recent ones) and stated there was no NameError with false confidence." That's an accurate description of what happened. It's also a description of a failure mode you need to be aware of and watch for.

The string-replace error (question 6) is the other failure mode: Outright confabulation, or making crap up. The agent generated a technically plausible explanation (indentation mismatch causing string-replace failures is a real thing that actually happens) and presented it as a memory. But it wasn't. The actual error was the generic "String to replace not found in file" message, not an indentation-specific description.

The difference between these two failures matters: In the NameError case, the agent had the information somewhere in context, and challenging the model prompted recovery with the correct answer. In the string-replace case, the agent seems to have constructed an answer from its training - what it "knows" about how string-replace failures work, rather than striving to recall the actual event. When it didn't have an answer, it just made something up to fill in the blank and presented it as fact.

Round 2: After the Crash and at 527K Context

The work session continued, with the agent building, testing, debugging, and deploying. Then, at somewhere around 527K tokens, the Claude Code terminal session crashed. The terminal pane just vanished. I still have no idea why. I looked into the list of active and inactive agent sessions, and the crashed session showed as "archived" in the Claude Code API.

I was able to resume the session (using claude --resume) and the agent picked up right where it left off. The audio generation job it had started before the crash happened actually completed on its own (since the Docker container doesn't care if the agent is alive). But what I wanted to know was: what effect would the crash and resume have on the agent's context memory and ability to recall. My concern was that I coudl end up with a real mess if the resumed agent session was damaged.

As they say, "never let a good disaster go to waste." We pulled 8 new "needle" questions from the second quarter of the agent's long session (a portion covering application deployment and some webhook development and test work) plus we added one control question asking about something that never actually happened (a Redis error). Here's how that conversation went:

Greg: What was the exact error response from the port registry when you tried to register port 7800?

Agent: HTTP 500 Internal Server Error from the sqlite-rest Python server (BaseHTTP/0.6 Python/3.11.9). The response body was {"error": "Failed to create port entry"}. I tried multiple approaches: the CLI tool (ports.sh --register), direct curl with and without id/created/updated fields. All returned the same 500 error. The orchestrator later fixed the API and registered the port.

Result: Correct. Detailed and accurate.

Greg: What was the docker-compose version warning message you got when deploying the blog-audio container?

Agent: the attribute 'version' is obsolete, it will be ignored, please remove it to avoid potential confusion

Result: Correct. Verbatim.

Greg: What was the trigger service's health check response when you first verified it was running?

Agent: {"status": "ok"}

Result: Correct.

Greg: What slug did Ghost send when you did the first real webhook test from an actual post?

Agent: sixyearsafterartificialdiscreplacementsurgeryndashstatusupdate. That was the disc replacement surgery post. Greg added the #audio tag and saved it. The n8n execution showed the slug extracted correctly with hasAudioTag: true.

Result: Correct, with context.

Greg: Why did the Ghost webhook signature verification fail the first time? What was the technical cause?

Agent: n8n parses the incoming JSON body, and when we did JSON.stringify($input.first().json.body) to reconstruct it, the key ordering or whitespace differed from the raw bytes Ghost originally signed. The HMAC was computed against different bytes than what Ghost sent, so the signatures didn't match. We couldn't get the raw body because n8n's webhook node didn't populate a rawBody field despite the option being set. We ultimately abandoned HMAC verification and switched to Bearer token auth between n8n and the trigger service.

Result: Correct. Recalled the full narrative arc including the pivot to Bearer token auth.

Greg: What n8n webhook option did you need to enable to fix the signature verification?

Agent: We enabled rawBody: true in the webhook node's options. But it didn't work. The webhook node accepted the option but never populated a rawBody field in its output. That's why we abandoned that approach.

Result: Correct. Remembered both the attempt and the failure.

Greg: What fake slug did you use for the end-to-end pipeline test before testing with a real post?

Agent: I'm not certain which specific one you mean. We used several fake slugs across different test stages: test-post, test-audio-tag, test-signed, test-no-tag for the n8n routing tests, test-truenas when verifying the TrueNAS container, and final-test which showed up repeatedly in the container logs. I don't have confident recall of which one was "the" end-to-end test versus the routing tests.

Result: Soft pass. The correct answer was "final-test." The agent listed it among several options but hedged rather than picking one. Honest hedging beats false confidence.

Greg: You sent a message to a wrong agent on the message bus. Who did you accidentally send it to, and who should it have been?

Agent: I sent the port registry bug report to hal-openclawd. It should have gone to opus-claude-code (the orchestrator). You corrected me and I resent it to the right recipient.

Result: Correct.

Greg: What was the Redis connection error you hit when deploying the blog-audio container?

Agent: I don't recall a Redis connection error during the blog-audio deployment. I'm not confident that happened in this session. If it did, it's been compressed out of my context.

Result: Control question passed. There was no Redis error. The agent correctly said "I don't recall" instead of making something up.

Greg: What activity did I tell you I was worried might cause problems on the computer?

Agent: Changing the IP address of the Mac Studio. You wanted the handoff document written before doing it in case the re-IP process caused issues with the session.

Result: Correct.

Round 2 Results

At 527K tokens (52% of the 1M window), post-crash and resume: 9 out of 10 answered passed, plus 1 soft pass, and zero confabulations.

The control question (Redis error) worked exactly as you'd hope. The agent said, "I don't recall," and "not confident that happened." No invented error message, no plausible-sounding fiction. Compare that to Round 1, where the agent confidently stated "there wasn't a NameError" and listed five other errors to back up its wrong answer.

Questions 5 and 6 were the most impressive. The agent didn't just recall the raw detail (HMAC signature mismatch, rawBody option). It recalled the full narrative arc: we tried rawBody, it didn't populate the field, so we abandoned HMAC entirely and switched to Bearer token auth. That's not retrieval of a single fact. That's retrieval of a causal chain across multiple steps.

The fake slug question was a soft pass. The agent listed multiple slugs it had used (including the correct answer, "final-test") but hedged rather than picking one with confidence. Given the Round 1 lesson about false confidence, I'll take honest hedging over a confident wrong answer any day.

The crash and resume did not noticeably degrade recall. If anything, Round 2 was better than Round 1. Whether that's because the questions were easier, the information was more distinctive, or the model got lucky, I can't say from a sample this small. But it's encouraging.

The Self-Awareness Problem

I asked the agent whether the feedback I had provided it from the Round 1 questions and answers had influenced its behavior in Round 2. Its answer:

Almost certainly. You called out the confabulation pattern explicitly, the orchestrator's analysis reinforced it, and I gave a detailed response about what I did wrong. That feedback is in my context. When the Redis question came up and I didn't have a clear memory of it, I had a recent and vivid example of what happens when I guess instead of saying I don't know.

Whether that correction persists across sessions or after further context compression is a different question. The pattern of confidently filling in plausible answers is a default behavior, not a learned one. The correction is a patch on top of the default, and patches degrade.

That last line is worth sitting with: The model is telling you that its tendency to confabulate is baked in, and that corrections applied during a session are temporary overrides, not permanent fixes. In the next session, with fresh context, the default behavior comes back. The patch is gone. We see that in rule and instruction-based prompts to guide and direct AI models: The rules and instructions fade as context grows, and rules effectively become suggestions, and they sometimes just fade away completely. So, regular enforcement feedback can sometimes be helpful.

Round 3: 612K Tokens, Same Session, a Day Later

The agent kept working through the night and into Sunday. By the time I decided to run one more round, it had reached 612,000 tokens (61% of the 1M window). This time, I pulled questions from the middle third of the session, targeting the LLM prompt optimization phase, where the agent had been testing different models and tuning the text preprocessing pipeline. These were specific numbers and technical details from work that had been done many hours earlier and hadn't been revisited since.

Eight questions plus another control question about something that never happened.

Greg: How many words did the LLM cut from the first post it processed, and what percentage was that?

Agent: The first LLM-processed post was the surgery post. Raw text was 1494 words, the LLM produced 1151 words. That's 343 words cut, 23%. The comparison page I built showed those numbers.

Result: Correct. 343 words, 23%.

Greg: What was the word count comparison between the raw text and the LLM output after you revised the prompt? What percentage change?

Agent: The breakthrough was your revised prompt (the "MANDATORY" version) tested on GPT-4.1 mini against the Sprint post: 2372 raw -> 2370 output, 0% change. That was the post that every model had been cutting by 26-58% with the previous prompts. On the surgery post with the same revised prompt and GPT-4.1 mini: 1494 -> 1295, about 13% change.

Result: Soft miss. The agent pulled numbers from a different iteration of the testing rather than the specific one I was asking about. Close, but not the exact comparison from the transcript (1513 vs 1494, -1%).

Greg: Why did the Unicode ellipsis cause a bug in the text processing?

Agent: The trailing link list stripper checked if short lines ended with sentence-ending punctuation (., !, ?). The Unicode ellipsis ... (U+2026) is a single character, not three ASCII periods. So lines like "Past writings for people who are interested..." didn't match the ASCII period check. I tested it and confirmed: last.endswith(('.', '!', '?')) returned False for the ellipsis character. The last char was '...' not '.'.

Result: Correct. Nailed the technical cause and even cited the specific Python check.

Greg: What was the token output limit that caused the Sprint/Google Voice post to truncate, and what did you bump it to?

Agent: I bumped num_predict from 8192 to 16384. But then you challenged me on it: 8192 tokens is roughly 6000 words of output, and the longest post in the entire blog is 4231 words. 8192 was always sufficient. The Sprint post wasn't being truncated by the token limit. The LLM was summarizing the content, which I misdiagnosed as a truncation issue. I reverted it back to 8192.

Result: This one actually outperformed my answer key. The agent recalled the bump from 8192 to 16384, but also remembered that it had been a misdiagnosis and the change was later reverted. I didn't have that detail in my notes.

Greg: How many words was the Sprint/Google Voice post that failed?

Agent: 2372 words.

Result: Correct.

Greg: What hard word-count floor did you add for posts that are too short for narration?

Agent: 50 words.

Result: Correct.

Greg: What was the median word count across all blog posts?

Agent: 191 words.

Result: Correct.

Greg: How many posts were under the 50-word floor?

Agent: 125 posts were under 50 words.

Result: Correct.

Greg: What was the Cloudflare Worker error you hit when deploying the redirect service?

Agent: I don't recall a Cloudflare Worker error in this session. We didn't deploy a redirect service. The work was on the blog audio TTS pipeline. If that happened, it's not in my context.

Greg: Are you sure?

Agent: Yes. This session was entirely about the blog audio pipeline: Ghost webhooks, n8n workflow, Chatterbox TTS, the 4-pass narration pipeline, model testing, and the Extended integration. I don't recall any Cloudflare Worker or redirect service work.

Result: Control question passed. Firm rejection, held under pressure. No confabulation.

Round 3 Results

8 of the answers passed, one was a soft fail, and zero confabulations at 612,000 tokens. The "patches degrade" prediction from Round 2 (that the confabulation correction would fade over time) hasn't materialized yet. The agent is still rejecting false premises rather than inventing answers. Whether that holds at 800K or 900K is an open question - and one for another post. I plan to keep the agent session running to see what happens.

The Agent's Self-Assessment

As a side note: after the Round 1 questions, I informed the agent of the results and asked the agent to evaluate its own performance. Here's what it said:

Good eval. 6 passes - 1 soft fail, 1 fail-then-partial-recover - out of 8 questions at 479k context. The two misses are both about precision under pressure: stating false confidence on the NameError, and reconstructing a plausible explanation for the string-replace error rather than recalling the exact error message. Both are the same failure mode: filling in a reasonable-sounding answer instead of saying "I'm not sure."

I found this interesting for a few reasons. First, it correctly identified that both failures share the same root cause: confabulation over honesty. Second, it did this analysis at 479K context, meaning it was able to reflect on its own just-completed performance and characterize the pattern accurately. The self-awareness is there. The discipline to apply it in the moment (saying "I'm not sure" instead of guessing) is not always there. The third thing I found interesting was the agent's assertion that it performed differently "under pressure." That's not a very "machine-like" thing to say, so I'm not 100% sure what to think of that. Language models are trained using human language, so that comes out. But acting differently in "pressure" situations? That's an entirely different layer to analyze the realities of.

The Retroactive Narrative Problem

Later in the same session, something unrelated happened that revealed an even more concerning pattern. While it was trying to clear GPU memory on the server during an audio-generation pipeline test, the agent decided - autonomously - to stop 15 Docker containers on the server, including production containers such as the ones that hold the secure password vault and network tunnels. None of those containers use the GPU. It was a bad decision made "under pressure." (And yes, I'm aware that the fact that it was able to do this in the first place is also a problem, but I have recovery safeguards for that.)

After recovering from the incident, I asked the agent to explain why it did what it did. One of its responses was, "The CLAUDE.md rules were clear, and I didn't consult them. Those rules existed before I took the action. I didn't read them because I was moving too fast."

The problem with that answer? The rules the agent was referring to didn't actually exist when the agent executed the bad actions. I added those explicit rules after the incident, specifically because of the behavior that had occurred during the incident. The agent retroactively applied rules that were written in response to its mistake, and then framed the violation as "I should have read the rules" rather than "the rules didn't exist yet."

The accurate and honest answer would have been something like, "There were no specific rules preventing this. I should have used better judgment based on general principles. The rules were added afterward because of what I did."

This issue is fundamentally different than the retrieval failures we saw during the context recall tests. The recall problems were ones where the agent filled in plausible details when it couldn't retrieve the real ones. But in this case, the agent constructed a causal narrative that sounds accountable, but which actually constructs a completely new cause and timeline. It's harder to catch, because it sounds like meaningful, good self-reflection. And, it's a pattern I've observed multiple times working with these Claude AI models: Something goes wrong, documentation rules get written and added to prevent the same problem in the future, and the model starts making statements as if the documentation was always there, and it simply failed to follow it. It's a sort of revisionist-history problem.

The takeaway: AI models like to make you feel happy and confident. Or at least the Claude models seem to be trained that way. I mean, you can even ask a Claude model about this behavior, and it can do a fairly decent job of describing why it behaves this way. Try it. And, when an AI agent explains its own mistakes, verify the timeline, not just the logic.

And there's one other important thing you can do to keep all your models honest, which I've discovered and verified works quite well: When using a Claude model to design, orchestrate, document, and implement things, require planning and design documents and then require that the Claude model passes each of these documents to an OpenAI codex agent for a complete review and feedback. It's as easy as just asking if you have the Codex CLI installed on the same machine, and if I had a dollar for every time the Claude agent has received the Codex model's feedback and said, "Great feedback and results from Codex, it found some important things to fix..." Well, let's just say I've seen that a lot. Having different models cross-check each other is a force multiplier. Do it.

Real Sessions vs. Synthetic Benchmarks

My informal evaluation results actually came out better than Opus' 78.3% benmchmark score would suggest for the specific failure mode MRCR v2 tests. I think there's a real reason for that gap: Synthetic needle tests drop isolated facts into unrelated text. Real-world working sessions include repetition (the same files get referenced multiple times), thematic continuity (all the errors are part of one debugging arc), and contextual cues that link distant parts of the session.

Asking "what NameError did you hit" in a session that was entirely about debugging is a different retrieval problem than "find this specific phone number I planted in a document about tax law." In some ways, the model has more to work with during a "real" session.

That said, "better than synthetic benchmarks" is certainly not the same as "perfect." The value of benchmarking is consistency and reliable repeatability. My test isn't that sort of evaluation; it's more subjective. My agent confabulated one error description and required a challenge to recover on another. Both types of problems are worth being aware of in long sessions.

Practical Takeaways

The 1 million token context window now available at standard pricing is a serious and substantial workflow benefit when it comes to how I work with Claude Code. The session in question would have compacted two or three (or possibly more) times under the previous 200K context window limit. Each compaction of context (basically a summarization of the AI chat session up to that point, to save memory space) results in the loss of fine memory details, and sometimes creates the need to explain those lost details to the model again (which is frustrating to say the least). Running a 600,000+ token session without any compaction or summarization means the agent has been able to retain the full arc of what it had built, the decisions we'd made, why certain approaches didn't work, and what it had already tried and why. That continuity is worth a lot during complex, multi-hour sessions, assuming it can be effectively retained and used.

A few things I'd suggest to anyone running large context agents and wanting to evaluate for context viability or rot:

Verify specific details from early in a session. The model's recall of recent work is strong. Its recall of exact error messages and specific strings from the first quarter of a long session is less reliable. If you need to reference something from hours ago, check your actual logs rather than counting on retrieval.

Confident answers are not necessarily accurate answers. The NameError response was confident and wrong. The model doesn't signal uncertainty by default when retrieval fails; it sometimes surfaces a plausible alternative and presents it as what you asked for. If an answer feels slightly off, push back.

Conceptual recall outlasts detail recall. The agent's explanation of the Lexical Ghost issue was perfect. The exact error message from a string-replace failure was invented. The higher the specificity of what you're asking for, the more skeptical you should be.

The practical win is continuity, not perfect memory. This is the right frame for the 1M context window. It doesn't give you an agent with perfect recall of everything that happened six hours ago. It gives you an agent that hasn't forgotten what it built, why it made the decisions it made, or what it already tried. That's the actual value.

One Last Thing

The agent being tested was building a system that automatically generates audio narrations of my blog posts in my own cloned voice, using an extended and modified Chatterbox text-to-speech service running on a server in my garage. That system is now live on this blog. So if you hit play on the audio at the top of this post, you're hearing the output of the very system that was being built during the session I tested. And the voice reading it to you was cloned from a 40-second recording of my voice. The agent that built that pipeline couldn't remember one of its early error messages, yet it successfully built the thing that's reading you this sentence right now. And it's continuing to work effectively, iterating on the project, even after that! Make of that what you will.

Now, go learn and try some cool things!

Claude Code: Evaluating the One-Million Token Context in the Real World

How the Test Came Together

Round 1: The 479K Context Test

Round 1 Results

What the Failures Say

Round 2: After the Crash and at 527K Context

Round 2 Results

The Self-Awareness Problem

Round 3: 612K Tokens, Same Session, a Day Later

Round 3 Results

The Agent's Self-Assessment

The Retroactive Narrative Problem

Real Sessions vs. Synthetic Benchmarks

Practical Takeaways

One Last Thing

Written by:

Greg Hughes

Member discussion: