A developer opens a fresh session, runs cat src/**/*.ts > context.txt, pastes the result, and asks a question about a single function three directories down. The pasted blob is forty thousand tokens. The function in question is forty lines. The reasoning the developer wants from the model is local. The context they gave it is the entire codebase. They did this on purpose. They believe they were being thorough.
I have done this. So have most engineers I know who use these tools daily. It isn’t one developer’s idiosyncrasy. Somebody nods at the screen-share, somebody replicates the move on their own machine, and within a quarter the team has agreed in some Slack thread that this is what “giving the model context” looks like. The instinct is one folk model (the intuitive but mechanism-free account of how these tools work) expressed two ways: context is storage, and more is always better, just in case. Storage you load up so the model “has access.” A buffer the model dips into when the question requires it. The bigger the buffer, the more the model knows.
The mechanism is the opposite. There is no buffer. There is no place the model goes to look things up. The context window isn’t a holding area for material the model will reference later. It is the workspace itself, and every token in it shapes every answer the model gives.
The window is what the model is doing#
Inside a single forward pass, the model attends across every token in the context, end to end. Every token contributes to the weighting of every other token. There is no internal cache of “things established earlier.” There is no compressed summary the model is keeping for itself. There are no facts the model “learned during priming” sitting in a separate place. The window doesn’t contain the workspace. The window is the workspace.
The folk model treats the model like a person whose office you are setting up. You want them to have the right reference books on the shelf so they can pull one down when needed. The mechanism is closer to a person whose entire desk is being re-read top to bottom every time they open their mouth. There is no shelf. The desk is the entire interaction.
Once that lands, the rest of this post is consequences.
Irrelevant context isn’t free#
The folk version says: as long as you stay under the window limit, extra context is harmless. It might not help, but it can’t hurt. The model just ignores what isn’t relevant.
The model doesn’t ignore anything. Adding forty thousand tokens of unrelated code to a question about one function does three specific things, each with its own mechanism:
- Attention dilution. Attention is a soft retrieval over the entire context. When the relevant information for the current turn is two thousand tokens and the surrounding context is forty thousand tokens of priming, the attention pattern for any output token has to suppress forty thousand tokens of distractors to focus on the two thousand that matter. The signal-to-noise ratio is set by what’s in the window, not by what’s relevant in the window.
- Lost-in-the-middle. Models attend more strongly to the beginning and end of context than the middle. Front-loading priming pushes the conversation history and the current task into the worst attention positions. The thing you actually care about ends up where the model is most likely to skim past it.
- No internal cache. The file tree from turn one is exactly as expensive to attend over on turn thirty as it was on turn one. There is no representation the model built early and kept. Every turn pays the full attention cost on the full transcript.
The cost isn’t running out of room. The cost is reasoning poorly with room to spare.
This isn’t a hypothesis. Testing across frontier models confirms degradation as an architectural property of transformer-based attention: performance falls off well before stated capacity across every model tested, and the shape of the falloff is consistent even as the threshold varies by model and task. The long-context version of this pattern is context distraction: the model over-focuses on what’s in the window and neglects what it learned during training. Long before you hit the ceiling, the agent starts anchoring on irrelevant material, quoting stale data, and latching onto patterns from files it read an hour ago. The window isn’t the constraint. What the window contains is the constraint.
The same mechanism that makes paste-everything obviously bad also makes the slow accretion bad. They are the same failure at different time scales.
The visible version is the codebase dump. The invisible version is the global config that loads into every session, started small, picked up additions across months. WSL path conventions. Commit format. PR format. Writing style. C# conventions. ADO CLI preferences. Database connection rules. Pull request comment API patterns. Agent launch verification rules. A handful of conventions that mattered when they were added, plus a handful that addressed workflows that have since changed, plus duplicates between sections nobody pruned. The file is now thirty-something kilobytes. Nothing was ever removed because removal requires confidence that nothing still depends on it, and confidence is harder than addition. Every session pays the full attention tax on every turn. Most readers will see their own files in this.
A note before this slides somewhere it shouldn’t. None of this is an argument that permanent context is bad. A tight, conventions-only file of project terminology and gotchas is the legitimate form. The thirty-kilobyte version is bad because of bloat, not because it is permanent. The line between the two is the line between content that shapes how the model reasons (terminology, conventions, project-specific patterns) and content that supplies facts the model could fetch when needed. Conventions earn permanent context. Facts don’t. That distinction is a separate post. The point here is just that the bloated file fails for the same reason paste-everything fails.
Caching doesn’t buy you what you think it buys you#
The most common reader objection at this point is prompt caching. A pasted prefix that costs forty cents the first turn costs four cents on every subsequent turn. The economics change. So why does it matter?
Caching changes economics. It doesn’t change behavior.
The model still attends over every token in its context every turn, cached or not. Caching eliminates the cost of computing the K/V representation for the prefix from scratch on each request. It doesn’t remove those tokens from the attention computation that produces the next output token. Prompt caching has no effect on output token generation. The cached forty thousand tokens of priming are still forty thousand tokens of attention pattern competing with the two thousand that actually matter for the current turn. Attention dilution stays. Lost-in-the-middle stays.
This is the slot the folk model loves to find. Cost and attention look like the same axis: more tokens mean more cost and more cognitive load. So if cost is solved, surely cognitive load is solved.
The clean way to hold the distinction: cost is about your bill. Attention is about the model’s answer. They were never the same thing. They were correlated in a way that made the conflation easy. Caching breaks the correlation and exposes that the cognitive cost was the load-bearing one.
The rule of thumb falls out: if a chunk of context wouldn’t earn its place at full price, the cache discount doesn’t earn it a place either. The discount lowers the price of the noise. It doesn’t lower the cost of reasoning over the noise.
The developer analogy is the right shape#
A developer working a real bug doesn’t load the entire codebase into their head. They couldn’t if they tried. They have a vague spatial sense of where things live: the auth code is roughly over there, the migrations live in that folder. They open the file the bug is probably in. They open the file that calls it. They might open one more if the call graph leads somewhere unexpected. The rest stays on disk.
The structural reason is bandwidth. Working memory is small and fast. The file system is large and slow but accurate. The split isn’t a flaw to engineer around. It’s the shape that lets the developer reason at all. A developer who tried to keep the entire codebase in their head wouldn’t be a faster developer. They would be a worse one, because every thought would be diluted by every other piece of code competing for the same attention.
The model is in the same position. Limited high-bandwidth attention over the current task, effectively unlimited slow-but-accurate access to anything you can put on disk and let it fetch. The shapes match because the constraints match.
The folk-model instinct here is that the model has different ergonomics from a developer because it is a machine, and machines are good at reading large amounts of input. The mechanism says no. The model can do something the developer cannot, which is read forty thousand tokens in a quarter of a second. That speed advantage is real. It is a separate thing from the question of whether forty thousand tokens should be in the workspace. Speed of intake doesn’t make irrelevant intake free.
This isn’t theoretical. Practitioners who have built context-loading tools have repeatedly converged on dynamic loading after trying the file-cached version and watching it go stale. One who maintained a “prime” command that wrote project context to a file abandoned it after the file kept falling out of sync with the code. Calling fresh on each session, against the live state, was strictly better than caching against a snapshot that was lying by the third turn. Different teams, different tools, same discovery.
The same convergence shows up in the way people prune long sessions. Practitioners who have written down their compaction rules treat exploratory file reads that didn’t lead to findings as the first category to drop. Raw file contents go: the read tool can run again. Build output goes, except for the summary line. The pattern that gets formalized, over and over, is the same one: keep the small working set, drop the rest, fetch when needed.
What I didn’t say#
I didn’t say the model has no memory between turns. The temporal version of this argument, why a long conversation doesn’t mean the model “remembers” what was said two hours ago in the same way a person would, is its own thread and a different mechanism. I’m keeping this post on the spatial side: the geometry of attention within a single context, regardless of whether that context was built across one turn or a hundred.
I didn’t prescribe a workflow change beyond the closing discipline below. The pull is to end with “so here is how to load context just in time, on demand, only what the model needs.” That’s a real practice with its own argument and its own examples. I am not making that argument here. The mental model has to land first. The workflow follows from it the way water follows from gravity, and a recommended workflow without the underlying mental model is a list of rules to memorize, which is exactly what the folk model produces and exactly what this series is trying to retire.
I didn’t get into transformer internals. None of it is needed for the claim. The claim is about how to think about the window, not how the window is implemented. The reader doesn’t need the architecture. They need the corrected intuition.
The discipline question#
Look at whatever permanent context you have right now. Your CLAUDE.md, your system prompt, your recurring boilerplate, the standing instructions you paste at session start. Read it as if you had never seen it before.
Then ask: what would I remove if removal cost nothing? Not “what could I remove if I had time to be careful.” The honest version. If a button existed that deleted any line you pointed at, with no risk of regret and no possibility that some forgotten workflow would break, what would you cut?
The threshold for what counts as bloat isn’t one I have a clean answer for. Somewhere between five hundred bytes of pure conventions and twenty kilobytes of accreted notes, the file crosses from carrying its weight to costing more than it pays. I have moved my own threshold several times and don’t have a stable rule for it. What I have is the question. Run it on your own files tonight. The answers won’t be subtle.
Further Reading#
- Drew Breunig, “How Long Contexts Fail”: the taxonomy of context failure modes (poisoning, distraction, confusion, clash) that names the long-context degradation patterns described above
- Nelson F. Liu et al., “Lost in the Middle: How Language Models Use Long Contexts”: the canonical paper documenting position bias in long-context attention, published in TACL 2024
- Chroma, “Context Rot: How Increasing Input Tokens Impacts LLM Performance”: 2025 study across 18 frontier models documenting degradation as an architectural property of transformer attention rather than a capability gap that scaling solves
- Anthropic, prompt caching documentation: the primary source for “prompt caching has no effect on output token generation,” the load-bearing fact behind the caching-vs-attention distinction
Comments