Context Pollution: Bigger Windows Won't Fix Your Agents

A senior engineer is forty turns into a session with a coding agent. The first twenty turns were exploratory: file reads to map a service, grep passes to find callers, a couple of dead-end branches that produced nothing usable, two web fetches against a vendor doc, an aborted attempt to run a build that needed credentials the agent did not have. The next twenty turns were the actual work: drafting a small refactor, walking the test cases, agreeing on a name. By turn forty the agent has started anchoring on a method from a file it read at turn six, citing line numbers from a version of the file that has since been edited. It quotes a function signature from a transcript section that is no longer accurate. It picks up a pattern from one of the dead-end branches and proposes it as if it were the consensus direction. The work product comes out subtly off. None of the suggestions are actively wrong; several of them are wrong-shaped in a way that takes a careful reader to catch. This is the failure shape the post calls context pollution.

The conventional read of this session is that the agent ran out of room. The context window is too small for the kind of work being done. The fix, on this read, lives one model generation away: bigger windows, smarter attention, a model that can handle two hundred thousand tokens without the focus drift. Until then, manage with compaction, periodic resets, and patience.

That read is wrong about the problem. The room was not full. The room was full of debris.

The framing this post argues for is the pollution family: a class of failure where a working surface accumulates noise faster than it accumulates signal, and the surface’s quality degrades not from running out of capacity but from filling with debris. Context pollution is the family member that lives inside a single agent’s conversation. The structural fix is not a bigger room. It is a different floor plan.

The bigger-window school, in its strongest form

The popular school says: context windows used to be small enough that context discipline was a real engineering constraint, and most of what teams built around agents was scaffolding to work around the constraint. Now that frontier models hold a million tokens and the next generation will hold ten million, those workarounds are obsolete. The right move is to stop adding plumbing and let the model see everything. The friction of splitting work into sub-tasks, of summarizing intermediate state, of choosing what to drop from context, all of it falls away when the room is large enough to contain the whole job.

There is a real case here and a serious one. Capacity constraints did dictate a lot of the early agent-engineering practice. Plenty of decisions that look like methodology were really workarounds for a limit that no longer binds. When capacity expands an order of magnitude, sweeping a layer of those workarounds is correct. The school is also right that hand-curating exactly which files an agent sees on a small task is over-engineering when the agent could simply read the directory. Many small tasks have gotten meaningfully easier as windows have grown, and pretending otherwise is its own kind of stale.

However.

The school’s premise is that contents take care of themselves once size stops being a constraint. The premise the rest of this post argues is the inverse. When size stops being a constraint, contents become the only constraint left.

In ordinary multi-turn coding sessions, focus degradation shows up well before any capacity ceiling has been touched. The agent starts anchoring on irrelevant material, quoting stale data, latching onto patterns from files it read an hour ago, while the token count is sitting somewhere in the middle of the available window. The window is not the constraint. What the window contains is the constraint.

Think about how you would brief a senior engineer on a decision. You would not hand them the seventeen files you read, the four web searches you ran, the two tests you re-ran, the design you sketched and abandoned, and the partial commit message you started typing. You would tell them the question, the relevant constraints, the option you are leaning toward, and your reservation about it. The brief is a hundredth the size of the path that produced it. The senior engineer makes a better decision from the brief than from the path because the brief is mostly signal and the path is mostly process. Inside one agent’s conversation, the path is what accumulates. The brief is what should be present at the moment of decision but rarely is.

This is the distinction the rest of the post turns on. Context pollution is the failure mode where a working surface accumulates noise faster than it accumulates signal, and the surface’s quality degrades not from running out of capacity but from filling with debris. When pollution is the actual failure, capacity expansion does not fix it; it postpones it. A bigger window with the same noise rate fills with more debris before the same degradation arrives. The prescriptions split: if the problem is capacity, the engineering response is to wait for bigger models, or pay for them, or compress and chunk to fit. If the problem is pollution, the engineering response is structural: change what goes into each working surface, and what crosses between surfaces.

What context pollution actually does at the decision moment

The clean failure mode is the call site picking the wrong overload because three earlier reads suggested the common path. The messier failure mode is the one where the agent over-attends to a string from an earlier file and parrots a pattern from it that was specifically not what the current code path needs. Drew Breunig has a precise vocabulary for this family of failure: context distraction when the model over-focuses on accumulated context at the expense of its training, context confusion when superfluous content drives a low-quality response, and context poisoning when an earlier hallucination becomes a load-bearing reference downstream. Each is pollution viewed under a different lens. None of them is a capacity problem. This post’s argument sits underneath Breunig’s vocabulary: it names the structural reason the surface fills with the contents that produce those failures, and the discipline that keeps the surface clean in the first place.

Three more symptoms are worth naming, because they are recognizable to anyone who has watched long sessions go off and they help separate the pollution diagnosis from the capacity one.

The first is decisional drag. The agent has the right answer somewhere in scope. The agent does not pick it. The agent picks something nearby that is more salient because it appeared more recently, or was repeated across files, or was the dominant pattern in the most recent search results. The right answer is in the room and loses the vote. From the outside, this gets called “the agent lost the thread” or “the model is just not as good at long tasks.” The thread was never lost. The thread was outvoted by everything else attending the meeting.

The second is identity drift. Twenty turns into a session, the agent’s behavior has shifted in ways nobody asked for. It has become more cautious, or more verbose, or more eager to suggest tangential improvements. This drift usually traces to the agent’s own outputs being part of the context the next turn attends to. Each output it produces becomes one more input it has to weight against. After enough turns, a meaningful fraction of the working context is the agent’s own past behavior, and the agent is now imitating itself. The room is full of it.

The third shows up in retries. An agent that fails a task once and is asked to try again rarely tries again from a clean slate; it tries again from the conversation that includes its first failed attempt. The first attempt is usually present in scope. The second attempt is influenced by it. Often this means the same wrong move is reproduced with cosmetic variation, because the most salient prior pattern is the wrong one. Engineers learn quickly that the correct retry strategy is often “drop the conversation and start fresh.” That instinct is the right one. It is also a workaround for pollution, not a solution to it.

What fills the room

The contents that produce those symptoms are predictable. Anyone who has watched a long session play out can name the categories.

Tool calls are the largest source. Every file read, every grep, every web fetch, every shell invocation surfaces back into the conversation as both the call and the result. A single grep pass that touches twenty matches across fifteen files leaves behind twenty result blocks, fifteen file paths, and the agent’s own paragraph of summary about what the grep showed. The cost is not just the window space the pile occupies; it is the focus tax paid every time the model has to scan past the pile to get to the question being asked.

Exploratory dead-ends are the most insidious source. The session pursues a hypothesis, runs four or five tool calls testing it, concludes the hypothesis was wrong, and moves on. The tool calls and partial reasoning from the abandoned branch stay in the context. They look, at later turns, indistinguishable from material that did lead to the current direction. Nothing in the conversation history flags a dead-end versus a live thread, and the pattern that looked plausible enough at the time to investigate still looks plausible later.

Intermediate state is the third source and the easiest to underestimate. Mid-task scratch reasoning, half-finished thoughts, the agent’s internal “let me think about this” prose, error messages from a build that failed and was retried, the second and third versions of an iteratively-refined attempt: all of it accumulates. A 500-word answer where 50 would have done is 450 words of noise competing for attention on every subsequent turn.

Then there is the response side, which is symmetric and routinely overlooked: the model’s own outputs become context for every following turn. A verbose model encouraged to elaborate produces its own pollution at the same rate as tool-call accumulation, and there is no equivalent of a tool-result line item to make the verbosity visible.

What none of those categories require is a near-full window. The degradation arrives long before the window is exhausted. The window can be at thirty percent capacity and the room can already be full of debris, because the debris-to-signal ratio is what matters, not the absolute volume.

The capacity workaround that misses

The standard workaround when sessions get too long is compaction: summarize the conversation so far, replace the raw transcript with the summary, continue. Some harnesses do it automatically when the window fills past a threshold. Some require an explicit command. The intent is reasonable: shrink the volume so the session can keep going.

Compaction is symptom management. It addresses the volume side of the problem and partially addresses the noise side, depending on what the summarizer chose to keep. The summarizer does not know which parts of the conversation matter most to the current task. Important decisions, specific configuration values, the exact wording of a constraint that came up earlier: any of it can get reduced to a one-line summary that loses the detail the next turn needs. Sometimes the summary is good; sometimes the summary is the problem now, because the model is making decisions against a lossy compression of context that no longer contains the precision the work requires.

Compaction is necessary when the structural fix is unavailable, and teams that work in long sessions develop deliberate disciplines around it. The discipline that has worked is to specify what the summarizer must preserve and what it should drop: keep the current task state, the specific values, the in-flight decisions; drop raw file contents (re-read with the file-reading tool if needed), full build output (keep only the summary), exploratory reads that did not lead to findings. That last category is worth pausing on. The fact that practitioners have explicitly named “exploratory reads that did not lead to findings” as a category of contents to drop is the diagnostic. It is a category of debris. It is the recognition, formalized, that the session accumulated material that contributes nothing to the work going forward.

The structural fix is to keep that category from entering the context in the first place.

The structural fix: results, not transcripts

The room stays clean if the work that would fill it never enters it. The mechanism is discrete delegation: a multi-agent design where the orchestrator hands a task to a sub-agent, the sub-agent works in its own context, and the sub-agent returns only the result; the orchestrator does not observe or accumulate the sub-agent’s intermediate state.

The principle, stated narrowly: when an answer is wanted but the process that produced it is not, send the work to a sub-agent. What returns is the answer. What stays in the sub-agent’s context, and never enters the orchestrator’s, is the file reads, the grep output, the dead-end branches, the half-finished reasoning, the intermediate state. The orchestrator’s session stays focused on the decisions it actually has to make. The next question gets the model’s full attention on what was actually asked, not on the residual noise of the last hour.

Two things about this are worth being precise on. First, it is not the same as breaking work into smaller tasks. Smaller tasks inside one conversation still accumulate inside that conversation. The structural property is the separation of working surfaces, not the size of any individual unit of work. Second, it is not about prompting style. Discrete delegation is not what you get by writing a longer system prompt. It is what you get by spawning a sub-process whose context the parent does not inhabit and whose intermediate state the parent never reads.

The shape of the practice can be made concrete. A research task that wants to know how a particular module is wired into the rest of a codebase is the kind of question that, asked in the main session, leaves behind file reads, grep output, inheritance-chain traces, and a few exploratory branches that did not pan out. Asked of a sub-agent with a tight scope and an explicit response-format constraint, the same question produces a tight summary the orchestrator can consume. The sub-agent did the same fifteen file reads. The orchestrator did not see them. Three sub-agents launched in parallel, each producing a short report on a separate aspect of a problem, and the orchestrator’s session contains three short reports plus whatever it had before. Not the file reads. Not the searches. Not the traces. The sub-agents have done substantial work; the orchestrator has paid for it in inference cost, not in pollution cost.

The practical move at the orchestrator boundary is verbosity calibration: specify in the task what shape the answer should take and what cap on length applies, because if you do not, the sub-agent picks the shape, and the shape it picks is usually longer than the orchestrator needs. “Under two hundred words. Three findings. No preamble.” That is not formatting fussiness. It is the discipline that keeps the result from being a small transcript in disguise. Without the cap, the sub-agent returns its discovery in narrative form, the discovery enters the orchestrator’s context as a six-hundred-word essay, and most of the pollution-savings the delegation was supposed to produce evaporate at the boundary. Spawning a sub-agent is cheap; designing what crosses the boundary is the part that determines whether the spawn was worth it.

The objection that arrives next is reasonable: doesn’t this just push the same problem into the sub-agent? The sub-agent’s context is also a window. It also accumulates. If the work is large enough, won’t it pollute the sub-agent’s session the same way it would have polluted the orchestrator’s?

The asymmetry that resolves the objection is that the sub-agent’s context is bounded by the task’s scope, not by the conversation’s history. The sub-agent starts with whatever task spec the orchestrator gave it, accumulates only the material that task requires, returns the result, and is gone. Its session ends at the boundary. The pollution stays in a context that no future turn has to navigate, because no future turn exists. The orchestrator’s session is the long-lived one, and it is the one that pays the cumulative pollution cost across many tasks if delegation is not used. Pushing the work into a short-lived sub-agent context is the entire point. The sub-agent context fills with debris and is then discarded; the orchestrator context never accumulates the debris in the first place.

The corollary is that the boundary discipline matters in both directions. The task spec going to the sub-agent should contain only what the sub-agent needs. Information that does not directly relate to the work at hand competes for the sub-agent’s attention with the information that does. A sub-agent given a tight task spec and a calibrated response constraint does the work and returns a clean result. A sub-agent given the entire orchestrator’s session as context, “in case it helps,” inherits the pollution the delegation was supposed to avoid. The discipline at the boundary cuts both ways.

A note on forked subagents

A recent shape of this pattern is worth calling out because it is going to be reached for often and it half-honors and half-violates the discipline this post argues. Claude Code’s forked subagents feature (currently experimental, behind an environment flag) spawns a sub-agent that inherits the entire main conversation as its starting state: same system prompt, same tools, same model, same message history. The sub-agent runs, does its work, and only the final result returns to the main session. The sub-agent’s tool calls and intermediate reasoning never enter the main context.

From the orchestrator’s perspective this looks like discrete delegation (the sub-agent works in its own context and returns only the result). The bookkeeping of getting the sub-task done stays out of the main conversation. Only the result returns. On the orchestrator-side ledger, the structural property holds.

The other half of the structural property does not hold. The principle is not just “keep the main context clean.” It is give each piece of work its own clean context. A forked sub-agent’s first move is not to compose a tight brief from what the sub-task actually needs. Its first move is to inherit everything the main session had already accumulated, whether the sub-task needs any of it or not. The sub-agent’s working surface starts polluted by inheritance. The framing the feature itself uses for when to reach for a fork is the case where a named subagent “would need too much background to be useful.” That is the case where composing a tight brief is hard, not the case where pouring the whole conversation into the sub is the right move.

Forking lets the orchestrator skip the composition step. The cost is paid by the sub-agent, which now does its work from a surface that contains all the residue the orchestrator had been carrying. The pollution that was about to degrade the main session’s decisions degrades the sub-agent’s decisions instead, with the additional disadvantage that the sub-agent has even less context than the orchestrator did for telling residue from signal. The result returns clean. Whether it was the right result is the question the structural argument was supposed to be defending.

The shape of the right move when this temptation arrives is the topic of the next post: composing the brief is the orchestrator-side engineering act, and skipping it pushes the failure mode somewhere harder to see, not somewhere it goes away.

What kinds of work belong where

A working rule, useful enough to apply on Monday. Work that accumulates path more than it produces decisions belongs in a delegated context. Work that is decisions about what to do next belongs in the orchestrating context. The split is between the deciding of what to do and the doing of it.

Reading a directory tree to find the right file is path. The choice of which file to act on is decision. Running a test suite and recovering from three flaky failures is path. The judgment of whether the suite is now passing meaningfully is decision. Searching the web for an API change is path. The choice of which version of the API to write against is decision. Drafting and refining a function until it compiles is path. The choice of where the function lives and what its signature is is decision.

Most agent failures show up at decision points. Most pollution accumulates during path. Putting path inside delegated contexts and keeping the orchestrating context focused on decisions is the structural lever. It is not a guarantee that decisions come out right; it is the precondition that gives them a chance.

There is a class of work that resists this split. Exploratory work, where the path is the decision, where you cannot say in advance what the answer should look like because the point of the work is to find out. That class is real and is exactly where single-context conversational agents shine, because the accumulation of intermediate state is part of what produces the answer. The mistake is treating that mode as the universal mode. Most production work is not exploratory in that sense. Most production work has a known shape and a knowable answer, and the path is overhead. The accumulating-state mode is a tool for one phase of work, not the default operating mode for all of it.

The acute amplifier: real-time multi-agent chatter

Context pollution, recall, is the failure where a working surface fills with debris (tool calls, dead-end reasoning, intermediate state) faster than it fills with signal, degrading decisions long before the window is full. There is one design pattern that takes that failure and makes it strictly worse. It is the pattern where multiple agents communicate with each other in real time during task execution, each accumulating the others’ transcripts in its own context.

The popular school treats this as an emergent capability, frames the agent-to-agent conversation as the locus of intelligence, and presents pipelines of chatting agents as the architecture pattern that scales beyond what a single agent can do. The strongest version of the case is that complex tasks decompose naturally into specialized roles, and that the agents talking to each other surfaces information no single agent could produce alone. There is real signal in that argument; some tasks do benefit from specialization.

From a pollution lens, real-time multi-agent chatter is the worst-case version of the failure mode, by construction. Every agent in the conversation accumulates not just its own work but the back-and-forth of every other agent. A coordinator that mediates between specialists carries the full transcript of every conversation it mediated, and as the work progresses the coordinator’s context fills with the cumulative chatter of agents whose intermediate work the coordinator never needed to see. The pollution rate is multiplied by the number of conversational participants. A two-agent conversation pollutes both agents at twice the rate. A three-agent pipeline with a coordinator pollutes the coordinator at three times the rate, plus the rate at which each agent pollutes itself.

A scope change mid-task is the scenario that exposes the cost most clearly. When direction changes during multi-agent work, the accumulated chatter from the abandoned approach contaminates the coordinator’s context badly enough that redirection is unreliable. The reliable move ends up being a full restart: dismiss the team, discard the in-flight work, spawn fresh agents with corrected context. No amount of “ignore what we just discussed” prompting reliably overrides the contamination already in the context.

The structural alternative exists. Discrete delegation with calibrated handoffs is how multi-agent work scales without chatter pollution: the failure where each agent in a real-time conversation accumulates every other agent’s intermediate work, multiplying the pollution rate by the number of participants. That development belongs to a later post. The note for this one is that the pollution lens predicts which design choice degrades and which scales.

Same shape, different surface: rule-set pollution

The shape this post has been describing (a working surface that fills with debris faster than it fills with signal, degrading the decisions made from it) is not unique to a single agent’s context window. The same shape appears, scaled up, in the rule systems that govern how a team’s agents behave.

A team that captures every observation from every agent session into a rules file, with no filter between raw capture and rule promotion, ends up with a rules file that has the same pathology as the polluted context. It accumulates noise faster than signal. Every single-occurrence preference, every one-off reaction to a bad output, every developer’s idiosyncratic correction enters the rules and competes for attention with the rules that genuinely encode team policy. The agents that read the rules file then have to navigate the noise on every invocation, and the rules file’s signal-to-noise ratio drops the same way a long session’s does. The teams that get sustained leverage from accumulated rules are the ones that capture aggressively but promote conservatively, with an explicit threshold separating raw observation from committed rule.

The mechanic is the same. The surface is different. A long session’s context is a working surface that fills with debris if there is no discipline about what enters; a rules file is a working surface that fills with debris if there is no discipline about what gets promoted. The pollution family is the broader frame. Context pollution is one member. Chatter pollution is another. Rule-set pollution is the third: a rules file accumulates raw observations and one-off corrections without a promotion gate, and its signal-to-noise ratio drops the same way a long session’s does. The diagnosis is the same in each case, and the structural fix is also the same: bound what enters the surface, and require explicit discipline at the boundary that decides what crosses.

When a team complains that “the agent lost the thread,” “the model got confused,” “we need a bigger context window,” or “the agent team produced an incoherent result,” the underlying failure is usually pollution misdiagnosed as capacity. The fix the team reaches for is more capacity. The fix the failure wants is structural. Capacity expansions do not eliminate pollution; they postpone the symptom.

What this is not: when capacity really is the constraint

This is not a claim that bigger context windows do not help. They help, sometimes a lot, especially for the class of work where the input genuinely is large and the agent needs to attend across all of it. A million-token window changes what is feasible in document analysis, codebase question-answering, and any task where the answer requires synthesizing across material that previously had to be chunked. Capacity bites in those cases, and capacity expansion is the right response.

It is also not a claim that single-agent conversational work is obsolete. For the exploratory mode named earlier, for short tasks, for the kinds of work where setup overhead would dominate, single-agent conversation is the right tool. Replacing every workflow with a multi-agent orchestration is the same kind of overcorrection that the original “scale up the model” reflex was.

The argument is more specific. When you observe an agent making worse decisions late in a long task than it made early, the first hypothesis to test is not “the model is not capable enough yet.” The first hypothesis to test is “the working surface this decision is being made from is polluted, and the structural change that addresses it is to give the next instance of this kind of work its own surface.” That hypothesis is testable cheaply: run the same task with the same model in a fresh context, with only the material a clean brief would contain. If the fresh-context version makes the call right, capacity was not the issue. The decision was outvoted, not unreachable. The discipline question is then about how much of the work in your stack is currently being asked from a surface that the polluted-vs-clean experiment would expose.

What does the boundary in your work look like?

Look at the most recent agent session that produced a result you had to discard or substantially rework. Not a session where the task was the wrong shape for an agent in the first place. A session where the task was a fair one to hand to an agent and the output came out subtly off. Trace the session backward to the turn the work went off. Look at what was in the context by that turn. How much of it was directly relevant to the question being answered, and how much was the residue of earlier exploration the model had to scan past? If the residue was substantial, that is pollution, and the question worth asking next is whether any of that residue could have lived in a sub-agent’s context instead of the main session’s.

Find one task in your current workflow that the main session is doing inline that could be delegated. The shape to look for is research: the kind of question where the answer matters and the process that produced it does not. Codebase research, documentation lookup, exploratory analysis to confirm a hypothesis. Move that one task to a sub-agent. Calibrate the response format at the boundary. Run the next session.

The check on the structural discipline is whether the orchestrator’s context, at the end of an hour of substantive work, looks like a transcript or like a workspace. A transcript reads as the history of every step the work took; a workspace reads as the current state of the decisions the work is engaging with. The transcript is what pollution produces. The workspace is what the discipline preserves. Whether the team has the right number of sub-agents, the right verbosity calibration, the right delegation patterns: those are tuning questions. Whether the team has the discipline at all is the first one.

I do not have a stable answer for where to set the verbosity caps, or for which categories of work always belong in sub-agents and which sometimes belong inline. The threshold I have moved several times depending on the work. The pattern that has held across all the calibration is that the orchestrator’s context wants to be a workspace, and the pollution that would make it a transcript belongs in a context the next turn never has to navigate. The discipline is whatever keeps that true.