Don't trust large context windows

225 points171 comments12 hours ago

dofm

I guess I am mostly enjoying learning the fundamentals of AI stuff, even though I disagree with the direction it is going.

But I am struggling to put into words how alarming I find the comments on threads like this — all sorts of good-natured anecdotes about how XYZ works for them that are more like the suggestions in pet care or cookery threads on Facebook.

(Or worse still, like any Facebook 3D printing group: anyone who prints but wants to understand what is actually going on will know what I mean, I think)

Any shared sense of rigour is just completely torpedoed by the LLM world, particularly the cloud LLM world it seems, and we are reduced to cargo culting. Nobody is any more right or wrong than anyone else.

Have you tried cleaning your context with dawn dish soap, letting it dry and then adding a layer of glue stick?

ETA: I don't want to sound so mean about people who try to help, here or in facebook groups. I guess I just find these threads so different to threads on more or less any other topic, where someone's suggestion can be debated or refined by other commenters and then someone will explain a thing about how bash history selections work that will change your entire life. With these threads they devolve to "isn't it weird that threatening it works?"

show comments

bob1029

I've been able to avoid context size issues by applying one simple constraint to my agent loop. What I do is prevent all tool calling in the user's top-level conversation thread. Anything that needs to tool call must happen in a recursive invoke of the agent, which returns whatever results to caller.

I can keep the same high level conversation going for an entire day over a million LOC+ codebase without ever hitting meaningful token limits. No compaction or summarization tricks needed. I can burn 50 million tokens in recursive calls and still not touch 100k tokens in my root conversation thread.

There is some rework needed to "bootstrap" the agent each time it has to descend back into Narnia, but this is still far more efficient than carrying around one big flat context that tries to cover everything all the time.

Recursion is very effective at controlling token use, but it can only go so far. I've not observed any uplift for recursive depth beyond 1. I have seen the agent attempt it a few times, but the practical performance is simply not there. External symbolic recursion does not appear to be something the frontier models have been trained for. They are fantastic at emulating recursion in context, but we don't want that if we are trying to achieve a reduction in token use.

show comments

kelnos

This has not been my experience with Opus since Anthropic released the 1M token context window for use under the subscription plans. I routinely push past 500k tokens, even sometimes up to around 800k tokens, and don't see this problem. I've seen it to some extent when getting truly near the limit, up around and above 900k tokens, though what I see isn't as severe as the author seems to see.

(And I rarely fill the context window that far anyway when working on a single task, or a series of tasks that are related enough to warrant the same context; more typical is anywhere between 200k and 600k or so.)

I'm not saying that no one ever has this experience, but it's odd to me that some people see it so often that it warrants giving it a name.

show comments

SwellJoe

Opus in recent versions is fine beyond 100k, but I usually do try to keep it under 200k.

But, this is also why so-called "memory" systems are usually a mistake that make the models dumber. They don't have memory, they only have context, and every irrelevant fact you shove into the context is less context for the problem. Less distractions, better results.

The way to have the agent remember things is to have it document its work, like a human developer would do if they wanted their project to be friendly to other developers working on it. Good developer docs with an index page and a good plan with checklists, in concise Markdown files, checked in to the repo is the ideal memory for models and the ideal docs you need to figure out WTF the model has been up to. Helps with code review, too, whether by humans or another model. There's no down side.

show comments

lordgrenville

Almost every comment here is appealing to personal experience. By contrast, OP refers to two studies that compare performance on some kind of standardised test over a range of models.

Can't speak to how good those tests are, but they can't be worse than anecdotal evidence for something as vague/subjective as LLM performance.

show comments

kristianc

I'm getting a lot of mileage out of basically acting like the AI's Product Manager, and insisting that it writes up short PRDs for every feature we propose to build. That gives it a reference over time of everything that has been built, but also makes it less liable to drift with each one. Each one gets its own conversation. For me this is a happy medium between stopping it going off the rails but also making sure it can reference past decisions when it needs to. The one thing I dislike about Pocock's method (not to use PRDs so much but to have an in depth discussion to get alignment) first is it wastes a lot of the best window on that initial back and forth.

show comments

tmp10423288442

There’s a simple way to solve this: just use Codex. The auto-compaction is really good, and lets threads go on for a long time without losing track. In case you do notice a session is starting to go off track, it’s straightforward to make a new session, ask it to summarize an old session into an AGENTS.md, and start it from there.

schipperai

Working in the era of 200k context window meant I had to narrowly scope tasks to fit in the context window, forcing me to think about how to reduce complexity and naturally resulting in atomic work. 1M context windows and the promise that the latest models are "better at long running tasks" made me lazy in how I scope tasks and quality got worse. I now went back to narrow-scoping one session per task and zero compaction, trying not to go past 400k context window. If I end up with a long session, I was likely too ambitious and should have broken up the task.

Considerations about what goes on in agents internally will probably not be part of software development for long.

Personally, I already see LLMs and agents as blackboxes. I give each feature request to multiple LLMs and then compare the results. I don't manually use "sessions" at all. I just look at the outcome. When I dislike it, I "git reset --hard", change my prompts and restart the feature request.

To have an ongoing sense of which agents perform best, I keep a log and calculate an ELO score of which agents meet my demands best. This score is imporant to me, not so much how the agent achieves it.

show comments

wood_spirit

Yes context management is key.

I do my own framework and spend a lot of time trying to debug this and it’s not so much the context size in hard numbers but rather the probability that there is debris or wrong directions in the window that are drowning out the things the user thinks are important.

This manifests in the llm that keeps going back to doing the thing that failed when they tried it just before the last approach etc. The frequency of things in the context window give weight even if they are the wrong things.

I have a lot of tricks like not giving the llm lots of tools but rather giving it a tool it can use to search for tools etc.

But the bigger solution is in process where you use something like superpowers to force the llm through stages and you control the context that carries forward.

doginasuit

I think of the context window as a pot of soup that you add ingredients to between meals. If you have a relatively focused recipe and you are able to add only the ingredients you want, the soup stays good. If you or the agent add an ingredient that isn't fresh, it is going to be difficult to salvage and it is better to start over with a new pot.

It is not that agents can't function with a large context window, they can if that information generally has a desirable signal (like a large initial document or a well-focused session). Mistakes and the confusing signals that come out of fixing mistakes are why performance degrades. I start to trust the context window less not as a matter of size but the amount of friction we run into. The friction can be random but it is more often an issue with the path that I have us on.

show comments

WilcoKruijer

I built a very small personal extension for Pi [1] that gives me a /last command. It clears the entire session, only retaining the agent's last output message. This allows me to do manual "compaction". Basically I tell the agent something like "state the plan as discussed with references to files that should be edited", and call /last, then tell it to implement.

[1] https://pi.dev/

nuc1e0n

I doubt the dropoff is as large as 100k tokens. I start a new session and paste the best results from the previous one as soon as as LLM makes more than a couple of missteps. Theres too much focus on fixing what's wrong rather than going back to what worked and amending in a different way.

If you don't point out what's wrong I find the LLM will go into great technical detail which consumes a lot of tokens, but not 'see the wood for the trees'.

It seems to me human beings also have mechanisms to compact context, which may be why we can forget what we came into a room for when going through doorways. I think it would be interesting to research which markers we use to compartmentalize our thinking.

daishi55

> the dumb zone, where attention drops off and the model starts forgetting what you told it five minutes ago

I use opus 1m context all day every day at work and I simply have never encountered this. I don’t even think about context windows anymore I just let it do what it wants re compaction. Hard for me to understand where this article is coming from.

faeyanpiraat

I'm actually doing a big refactoring in a project where if everything gets loaded (code / docs), the context gets like 750k filled (Opus 4.8), and then the agent has the remaining ~200k to do actual coding, until I have to reset. I haven't finished the work but I'm like 80% there, and it seems the progress is good and the quality is also good, verified by doing some performance tests and a lot of comparisons between outputs between the original code and the new one.

Maybe I could achieve better and quicker results with keeping the context in the proper zone, but trying it will have to wait until the next project.

deliciousturkey

I dislike the non-specificity of "models" here. Different models have different attention architectures, and can therefore have significant differences in long-context behavior. It's true that long context is an issue can most models do drop off in quality, but I would not extrapolate behavior of old models to new ones.

monster_truck

This has not been my experience and I do not think any of the methodologies testing for this do so usefully.

_def

Funny to read about that superpowers repo, since only yesterday I wrote skills to do some markdown-plan centered aproach. I feel like smallish local models are getting capable of lots of things now, but they need lots of structure for resiliency.

show comments

PeterStuer

I've had no problem with Claude Code Opus 4.8 effort max using 20% token context (200k) on software development tasks (all stages). I aways load core source files and the ones we are working on up front. Around 20%, I make it autoprepare for a new session and clear.

Admittedly I have been doing this precautiously, based on anecdotal evidence, not because I had bad experiences with longer context deterioration myself.

In the brief time I had access to Fable 5, it went on long running tasks (>45 mins) into the 30-40% zone without apparent context coherence problems.

mcapodici

I /clear all the time out of habit. I want to be able to get the thing done with minimal context. It also means you can do it again slightly different if needed, you know the seed conditions for the task.

RandyRanderson

Why is it surprising that, at some point, more information will lead to worse performance?

It seems obvious. Moreover, in a simple model, it seems like whatever tokens you do add have to have MORE information than the average in the existing window.

In a non-trivial model (and this is the model I would choose), since you are adding them to the end, they likely have to have MUCH more information.

Proof as always is an exercise to the reader.

brunoluiz

100% with the author on that one, albeit the performance decay seems to depend on the type of task for me. Simple plumbing tasks seem to run okay with longer running contexts.

Also, some colleagues were playing around with RTK (https://github.com/rtk-ai/rtk), which decreases the amount of token used by tool calls and, although it seems an interesting idea, I am pretty sure there are many caveats. Although, I believe if these type of tools prove to be efficient enough, perhaps harnesses will have them natively.

torginus

Considering how expensive context is in terms of compute, I wonder why (and if ) vendors don't invest more into context engineering.

When it comes to source code, I feel like LLMs could just as well work with something like minified source code, if an LLM is trained on programming well, I think there's no reason why something like a variable should be represented by something more than a single token. Comments can be discarded, etc. In fact considering embeddings for LLMs are very rich, I think common ops could be reduced to a single token.

Imo that's why LLMs are soo good at reverse engineering. A lot of the time, assembly (with symbols) is pretty close to the source code, but compressed and encoded, and if you're familiar with the patterns of your compiler, reversing it is not that difficult.

Anyways, context engineering could be huge boon to input token curation imo (and maybe it already is)

afc

The approach we're taking to deal with this very real context rot is using a bunch of related techniques which we call transposing the agent loop: https://alejo.ch/3jt

In essence, we run many short agent loops, generating their prompts dynamically from structured data. Each loop advances the state in a small step towards the final goal.

kuboble

I think it's Your mileage may vary.

Few of the best sessions I have ever had with claude went into 700-800k territory.

I frequently reach 400-600k without visible (to me) signs of quality regression.

amunozo

Can anybody explain me why just not limit the context window to something smaller instead of all that context engineering? It forces things to be constrained.

steveridout

I wonder how much this depends on the quality and consistency of the context?

For example, it may be the case that a long context full of useful information relevant to the task is completely fine, perhaps even beneficial. And if the context contains a bunch of unrelated tangents and conflicting instructions, then it will be detrimental.

Have there been studies on what makes models get dumber? To what extent is context length to blame vs context quality?

rsanek

> The number on the box gets bigger every release.

Not really tho right? Since we got to 1m context in mid 2025 nearly no one has gone higher.

walthamstow

There's an env var you can set in Claude Code to bring the autocompact threshold down, effectively setting your own max context window. I have it at 400k.

k__

100K seems quite much.

I had the impression, models would get inconsistent after just 3000 words.

da-x

Perhaps compacting the context can be made in multiple requests over smaller and overlapping chunks to avoid using the 'dumb zone', and for yielding a better result.

mightyham

Even taking the author's criticism about large context windows for granted, which in my experience are exaggerated, they are still a huge UX improvement over short windows. That reason alone is enough for me to support them.

cowang

Evaluating the Sensitivity of LLMs to Prior Context

https://arxiv.org/abs/2506.00069

vlan121

Maybe this is the line, we'll hit eventually. Maybe the models become smarter, but the context will sit.

jackxlau

In my own testing I have seen peak performance happen usually within 15-20% of the intended context limit, albeit there are a few optimizations depending on the task quality.

dalemhurley

It is a lot like giving a person instructions, the more you tell them, the more they will forget the specifics.

Der_Einzige

Long context generation is a sampling problem. Set your opencode to use a modern sampler like min_p or newer and you'll see models behave better at longer context.

Febriss33

i let the main loop spawn sub terminal via tmux to prevent large contexts. it's great to divide tasks in small patterns and consolidate it step by step.

mystraline

Why is it a "dumb zone"?

What in the models causes this 'dumbing down'?

cubefox

The problem with "context rot" is that its existence and severity is purely anecdotal. As far as I know, nobody has actually measured context rot systematically. The only thing we know is that memory degrades somewhat in long contexts, via things like needle in haystack tests. But that's not the same issue. Context rot is usually taken to mean that the model gets dumber even if it doesn't need to remember specific things in its context window.

This would be really easy to measure. Just take some standard benchmarks, but fill up the context beforehand. Is the benchmark performance degraded? If so, by how much?

show comments

petesergeant

Is there any chance that this is because training corpus largely consists of documents shorter than the advertised context windows?

carterschonwald

context window size isnt quite the issue though, its that the attention mass kinda spreads out too much and everything kinda converges to a sortah global average region full of what we know to be slop! theres some really cool ways at the harness or model layer to mitigate this. just isnt really prioritized by the labs often.

andrewshadura

> dumb zone

Reminds me the sign, "Do not dumb here. No dumb zone."

woadwarrior01

aka Softmax context rot

mock-possum

Hasn’t been my experience at all - 1M window is a very clear upgrade working with Claude code.

BrenBarn

Even better, don't trust LLMs at all.