Composing the Context: Agent Failures as Composition Bugs

A planning agent composes a spec that uses a third-party library’s configuration object. The spec names a config key the agent confidently expects the library to read. The key is plausible. It matches the library’s documented surface from a couple of years back, which is what training data is dense in. The implementation agent builds against that key, wires it through the loader, writes serialization helpers around it. The unit tests pass against the agent’s own assumed contract. The code review goes through, because the reviewer is reading the implementation against the spec, not the spec against the library’s current source. The failure surfaces at runtime, several stages and several hours downstream of the original assertion. The library renamed that key one major version back. The lockfile would have shown the version. The package manifest would have shown the version. The agent never opened either. The work has to be redone from the spec’s composition down.

The conventional read of this story is that the planning agent hallucinated. The model invented a fact. Better models will hallucinate less. Until then, manage with reviews and rework.

That read is wrong about the problem. The model didn’t invent a fact. The model answered a question from the only source it had: its training data. The context it was working from didn’t route the question “what version of this library is in scope” to the lockfile, where the answer was unambiguous. So the question got answered from training, the way most unrouted questions do. The failure was upstream of the model. It was a composition failure.

The framing this post argues for is that the context an agent works from isn’t a given. It’s an artifact someone (or some upstream process) deliberately composed. Most failures attributed to bad model output are composition failures: the context didn’t have the right material, didn’t have it in the right order, or didn’t make clear which source wins when sources disagree. The post develops three patterns that, together, constitute composition discipline: source routing, self-containment, and authority. They’re three views of the same engineering activity: deliberately assembling the inputs to a delegation, before the agent runs.

The give-the-model-more-context school, in its strongest form#

The popular school says the answer to bad output is more context. The current generation of frontier models can hold a million tokens in scope. The right move is to put the relevant codebase in scope, attach the documentation, include the recent design discussions, hand the model the spec and let it reason with the full picture. Drew Breunig and others have done excellent work giving this practice its name and its vocabulary, calling out context engineering as the systematic engineering of contexts in pursuit of an outcome and naming specific failure modes (context confusion, context poisoning, context distraction) that explain why a poorly composed context degrades the answers it produces. Birgitta Böckeler frames the same practice as “curating what the model sees so that you get a better result”, with the honest caveat that it is “not really engineering” because the medium remains probabilistic. The school is right that this is engineering, and right that the contents of the context determine the quality of the output much more than the model does. The case the school makes is real. Plenty of agent failures are addressable at the prompt-and-context layer. Plenty of teams who think they have a model problem actually have a “did not say what you meant” problem.

The strongest case for putting more in is that under-specified context produces obvious failures. An agent told to write a service method without being told the codebase’s conventions for dependency injection picks the convention from its training data, which was a different convention, and the resulting code doesn’t match the patterns around it. An agent asked to add validation without seeing the validation patterns the codebase already uses invents a third one. Cases like these are common. The fix in each case is information the agent didn’t have. More context, in the right places, prevents the obvious failures.

However.

The school’s premise, when followed without further discipline, is that if you put enough material in, the agent will sort it out. The discipline this post argues for is the inverse. The model doesn’t sort it out. The model does what it always does: produces the most plausible-looking answer it can construct from what is in front of it. If two of the things in front of it disagree, it picks one. If something the agent will need isn’t in front of it, the agent answers from training data, where the answer is always available and frequently wrong for this codebase. The cost of an unrouted question isn’t visible until the wrong answer ships.

A sharp prompt embedded in a working set that has the wrong material, in the wrong order, with no authority resolution, produces the wrong answer with confidence. A merely-adequate prompt in a working set composed with discipline often produces the right one. The prompt is a lever. The working set is the surface the lever operates on.

The composition question isn’t “did I put enough in?” It’s three different questions: did each question this work depends on get routed to its right source, can the agent finish without any lookups, and when sources disagree, which one wins?

Source routing: every question has a right source#

Source routing is the discipline of routing each question the agent will encounter to its right source during context composition, before the agent has guessed wrong. It’s not advice to “search more.” It’s a classification system. Some questions belong to the codebase and should be answered there. Some questions belong to the human and should be saved for a real conversation. Some questions belong to a specific upstream artifact (an external service’s source, a database’s schema, a vendor’s documentation) and the right answer can only come from that source. Each question has a right source. The composition step is the place where the routing happens, before any of those questions become assumptions an agent absorbs and continues from.

The default failure mode when routing is absent is that the agent answers from training data. Training data is broad. It represents millions of repositories and covers every common pattern across every popular framework. When the spec says “register the service as a singleton,” the agent reaches into that breadth and chooses an approach. The problem is that breadth and specificity conflict on the details. One codebase’s container uses one registration mechanism, the dominant pattern in training data uses a different one, both compile, and the inconsistency is the damage. The right answer was discoverable from the codebase. The unrouted question went to training instead.

The shape of the discipline is a triage. Three categories cover most of the practical surface.

The first category is do not ask the human. Naming conventions, file structure, error-handling patterns, logging format, configuration key conventions, the project’s dependency injection mechanism: all of it is in the codebase. Asking a human to recite it back wastes time, produces an answer that may not match what is committed, and trains the agent to default to humans for things the codebase already encodes. Route to a code search instead. Five or more consistent examples means use the pattern, two to four means use it with a flag, one or conflicting means escalate.

The second category is search the codebase for similar implementations. When the work has analogues already shipped, the right pattern is the one already used three or four times. Pattern reinvention almost always traces to the agent answering from training data when the codebase had a perfectly good pattern one search away.

The third category is ask the human. Business decisions about what the system should do. User-facing text, including whether something is user-facing in the first place. Missing requirements no amount of codebase reading would surface. The right source is the engineer or product owner, and routing these to code is its own failure mode (the agent invents a plausible business rule nobody validated).

A fourth category, harder than the first three, covers questions that go to specific upstream artifacts. External service behavior belongs to that service’s source. Database behavior in the presence of triggers belongs to the database’s metadata. Third-party library behavior belongs to the lockfile, the package manifest, and the version of the library’s source those pin. The renamed-config-key failure at the top of this post is a routing failure of this fourth kind. The question had a right source: the lockfile, by construction. The composition step didn’t route the question there. The agent answered it from inference, which is the routing of last resort.

The triage didn’t appear fully formed. Practices like this grow through correction. A developer points out that an agent is asking three questions in a row that the codebase already answers, and the resulting rule is “do not ask the codebase’s questions to the human.” A spec ships with a wrong assertion about an external API, and the resulting rule is “behavioral assertions about external services must be routed to the service’s source.” Each new failure adds a category to the table.

The discipline question for source routing is: at the moment the spec is being composed, has each question the implementation will encounter been routed to a source you trust? Or are some of those questions left implicit, where the agent will route them to training data by default?

Self-containment: the silent assumption substitution#

The second pattern is self-containment: a context is self-contained if the agent can complete its task without going to look up things that should have been included up front. The cost of breaking self-containment isn’t the lookup itself. The cost is the silent assumption substitution that happens when the lookup fails or the agent decides not to bother.

The mechanic is recognizable. A spec describes a database update. The spec lists the table and the columns being changed. The spec doesn’t mention that the table has an update trigger that rolls back any UPDATE that omits a particular timestamp column. The trigger is invisible from the spec. The agent could discover it from a database metadata query. In practice the agent doesn’t, because nothing in the spec said to. The agent writes the UPDATE without the column. The unit tests pass, because the unit tests mock the database. Integration tests fail, because the trigger is real. The failure traces back to a piece of context that should have been in the spec and wasn’t.

The agent-pause assumption is what makes this damaging. Humans, when something doesn’t make sense, pause and ask. That pause is a hidden checkpoint that compensates for skipped composition discipline upstream: the human implementer catches the missing column because it doesn’t match the schema they remember from last quarter. Agents don’t pause. They fill the gap with the most plausible-looking answer available and continue, with no flag that the answer was generated rather than read.

The practical test for self-containment is structural: a handoff spec is self-contained if the implementation agent can complete the work using only that spec, without needing the parent story, the conversation that produced the work, or any external documentation the spec references but doesn’t include. A spec that says “follow the patterns in the X module” has externalized the patterns and created a lookup the agent may not reliably perform. A spec that says “the patterns to follow are these” with the relevant excerpts inline eliminates the lookup.

There’s a more concrete artifact worth describing: a hand-written document produced between two implementation sessions, specifically to compose what the next session would start with. A session ended. A new session was about to begin on the same work, several hours of context apart. Instead of letting the new session start from the original spec alone, an engineer wrote a short document for the next session that named which artifact to read first and why, gave a reference table of supporting artifacts with “when to consult” guidance for each, and listed the first six actions in order. The document also named, in plain words, the failure mode it was preventing: skipping the document and re-deriving everything from the codebase would work, and would also waste hours and risk fabricating expected values that contradicted what the previous session had already proven. That last sentence is what self-containment looks like as a discipline written down by someone who had paid the cost of its absence.

The discipline question for self-containment is: if the implementation agent had only this spec and the codebase, with no access to the conversation that produced the spec, would it be able to finish the work without substituting an answer to a question the spec didn’t address?

Authority: when sources disagree, which one wins#

The third pattern is authority: the explicit rule that resolves which source wins when sources in a context disagree. Most contexts don’t answer this. The agent picks one without telling you. Source disagreement is the default state of any non-trivial context (codebase pattern vs. training-data pattern vs. spec mentioned in passing), and in the absence of an authority rule the agent picks the more salient signal. Whichever way the salience falls, the choice is invisible. The output uses one source’s answer. The other isn’t flagged as having been overruled.

The fix is to make authority explicit during composition. State which source wins for which kind of question, before the agent has to choose. The codebase wins over training data for implementation patterns in this codebase. The external service’s source wins over inference for that service’s behavior. The story spec wins over the implementer’s prior assumptions for what the work is. The team’s coding standards win over the model’s defaults for naming, formatting, and structural conventions. Each of these is one rule. A small set of them, stated up front, removes most of the silent disagreements.

The format that has held up in practice is a precedence list. The first source on the list wins where it speaks. The next source wins where the first doesn’t speak. The list ends at the level beyond which the agent should escalate to a human rather than keep guessing. A four-level list is usually enough. The list’s value isn’t in being exhaustive but in being explicit. An agent operating under a precedence list doesn’t have to invent the resolution rule on each conflict.

The discipline question for authority is: when two sources in your composed context disagree, which one wins? If the answer is “I would have to read both and decide,” the agent has the same problem with no read-and-decide capacity.

Composition is upstream of the model#

Source routing, self-containment, and authority are three views of the same activity. All three are decisions made at the moment context is being composed, before any model touches it. All three address failure modes that the model can’t recover from on its own, because the model doesn’t have access to the question the routing answered, the missing piece the self-containment included, or the rule the authority encoded. The composition step is upstream of every other lever in the system.

This reframes a lot of conversation about AI-assisted development. When a session goes wrong and the team’s first instinct is to look at the model layer (was it the wrong prompt, the wrong tool harness, the wrong sampling settings?), the composition step usually sat unexamined. The first hypothesis worth testing is whether the context the agent was working from carried the right material, whether the agent could finish without lookups, and whether the rules for source disagreement were explicit. Most of the time, one of the three was missing.

It’s also why the same model, in my experience, produces noticeably different output quality in different teams’ hands. The model is the same. What’s in front of the model isn’t. A team that has invested in routing each anticipated question to its right source, in making each handoff self-contained, and in stating which source wins when sources disagree, is feeding the model a sharper artifact than a team that hasn’t. The output gap isn’t a model difference. It’s a composition difference, and composition is where engineering effort has the highest leverage in this stack.

This applies with particular force in multi-agent architectures. Under discrete delegation (the design where an orchestrator hands a task to a sub-agent that works in its own context and returns only a result), composition is the only engineering act the orchestrator can perform to shape that sub-agent’s work. The orchestrator doesn’t get to observe the sub-agent’s intermediate state and steer mid-task. The artifact it composed up front is the entire surface the sub-agent operates on. Without composition discipline, delegation is asking a different agent the same badly-composed question and getting a different shade of the same wrong answer.

A lot of work labeled “prompt engineering” or “model handling” is composition work mislabeled. The fix for an agent that picks the wrong overload is rarely a better prompt. It’s usually a better-routed source. The fix for an agent that invents a constraint is rarely temperature tuning. It’s usually a self-contained spec. The fix for an agent that produces inconsistent style is rarely a model-layer adjustment. It’s usually an authority rule. The model lever has less travel than the composition lever, and most of the leverage on the composition lever is unused.

What you cannot compose your way out of#

The composition discipline isn’t a complete answer to agent reliability. There’s a class of question no source available at composition time can answer. The spec writer doesn’t know, the codebase doesn’t encode it, no upstream artifact contains it, and the human hasn’t decided yet. The right move is for the agent to flag the question rather than substitute an answer, and that requires the agent to recognize when it’s in that situation. Detection of “this question has no good source” is a hard problem I don’t have a stable answer for. The mitigation in practice is to make the precedence list end at “stop and surface” rather than “decide somehow,” so the agent defaults to escalation when no source carries authority. That’s not a fix. It’s the least bad partial answer I have found, and the place where the composition-discipline argument runs out.

The other thing composition discipline can’t do is keep itself up to date. Every routing rule, every self-containment requirement, every authority precedence is a snapshot of what the team learned from the failures it has seen so far. New failure modes produce new categories. The rules grow. They also, if no one prunes them, accumulate noise the way any rules system does: one of the scaled-up members of the same pollution family the prior post in this series named (the class of failures where a working surface accumulates noise faster than signal, degrading not from capacity but from debris). Keeping the composition rules sharp is its own discipline, and the cost of letting them drift is that they slowly stop binding what they were written to bind.

Two clarifications, before the closing check. This isn’t a claim that the prompt doesn’t matter. The prompt is part of the working set, and a sharp one composed against a well-assembled set produces sharper output than a vague one against the same set. The argument is that the prompt is a small part of the surface, and tuning it without composing the rest is solving the cheap problem and leaving the expensive one in place. This also isn’t a claim that composition is a finished discipline. The patterns are the discipline. The practice of composing well for a particular team and codebase develops over time, and the threshold for what counts as enough source routing moves as the work changes.

Diagnosing a composition failure: three questions#

The three patterns above (source routing, self-containment, authority) collapse into a diagnostic the next time an agent session goes wrong. Pick the most recent agent session whose output you had to redo. Not a session where agents were the wrong tool for the task. A session where agent delegation was the right approach and the answer came out subtly off in a way that took a careful read to catch. Trace the failure back to the moment the agent committed to the wrong move. Then ask three questions about the context the agent had at that moment.

First: was there a specific question the agent had to answer to make this move, and was that question routed to its right source in the context, or did the agent answer it from training data because the routing was implicit?

Second: did the agent have to do a lookup mid-task that the context could have included up front, and was the answer it ended up with a real lookup or a substituted plausible-looking answer?

Third: were any two sources in the context in conflict at the moment of decision, and was the precedence between them stated, or did the agent pick one without flagging that it had picked?

If any of the three answers points at a missing piece of composition discipline, the model isn’t the lever. The composition step is. The model will keep generating plausible-looking answers from whatever you give it. Whether those answers are right is mostly determined upstream of the model, by the artifact someone composed before the agent ever ran.