Multi-Agent Orchestration: Boundary Contracts That Hold

An orchestrator spawns three sub-agents in parallel to look at three different angles of a problem. The first sub-agent comes back in twelve seconds with a four-line answer. The second sub-agent comes back in two minutes with a fourteen-paragraph essay. The third sub-agent comes back in three minutes with what reads like a small consultancy deliverable: an executive summary, a methods section, six findings, three recommendations, an appendix. The orchestrator was supposed to consume all three in one pass and decide which direction to take. It now has three working surfaces of wildly different shape stacked into its context, and the sub-agent that wrote four lines is the one whose answer the orchestrator can act on. The other two have buried their answers inside the bookkeeping of how they got there. The orchestrator’s next decision isn’t made from a sharper picture. It’s made from a noisier one. The parallel spawn was supposed to be a multiplier and instead it’s a pollution event.

This is the failure shape multi-agent orchestration produces when nobody designed the boundary. Once more than one agent is in a pipeline, a class of design tradeoffs appears that doesn’t exist in single-agent flows: who composes context for whom, what crosses the boundary, what each side is allowed to assume about the other, where the pipeline pauses and where it chains. The load-bearing engineering act is the discipline at the boundaries between them. Standardize the shape of what comes in and what goes out. Calibrate the verbosity. Leave the agent free in how the work gets done inside the boundary, but fix what crosses it.

The agent-team school, in its strongest form#

The popular school treats the multi-agent question as a coordination problem and answers it with a coordination architecture. AutoGen, CrewAI, and LangGraph are the most visible expressions of the school in current practice. The framing is recognizable across all three. Agents are specialized roles: planner, researcher, critic, executor. They communicate through shared message buses, shared state, or graph-edge transitions. Emergent behavior comes from the conversation between them. The orchestrator is itself an agent that mediates the others. AutoGen describes its AgentChat as “a programming framework for building conversational single and multi-agent applications,” with each agent broadcasting its response to all other agents to maintain a consistent shared context. CrewAI frames agents as employees: “agents communicate and collaborate like human team members.” The whole stack is built around the idea that intelligence at the team level emerges from the conversation among the participants, the same way it does in a well-running engineering team.

The case for this school isn’t weak. There are real problems that decompose naturally into specialized roles, where the work of one role genuinely informs the work of the next, where letting agents converse with each other surfaces information no single agent could produce alone. A planner agent that proposes an approach, a critic agent that argues against it, a synthesizer agent that resolves the disagreement is a recognizable pattern from human engineering practice. Codifying it as a multi-agent architecture isn’t a category error. It’s a thoughtful response to a real coordination question. The school is also right that single-agent flows hit a ceiling.

The school is already evolving away from its purest form. AutoGen’s successor, the Microsoft Agent Framework, replaced GroupChat with explicit graph-based workflows. LangGraph’s shared-state model isn’t free-form chat. Agents read and write to a shared state object rather than passing messages directly to each other, with checkpointing at every node. The most rigorous current implementations have been pulled, by production pressure, toward more explicit coordination surfaces.

The school’s premise is that the boundary between agents should look as much as possible like the boundary between collaborating humans, with conversation as the medium of coordination, the way two engineers exchange in a pairing session. The premise the rest of this post argues is the inverse. The boundary between agents should look as little as possible like the boundary between collaborating humans. The agents aren’t pairing. They are processes. The medium that scales isn’t conversation. It is contract: a tightly specified handoff where each side knows exactly what to deliver and exactly what to consume, and neither side observes the other’s process. Kent Beck has put the cleanest version of the underlying point: “Multi-agent is a feature. Outcome-orientation is the thing the feature is supposed to deliver. We keep getting those confused.” The popular school has confused the feature for the thing.

The failure shape isn’t specific to parallel spawns. The same problem appears in chained pipelines. A pipeline runs three agents in sequence: a researcher, a writer, a reviewer. The researcher returns a thorough report. The writer reads the report and produces a draft. The reviewer reads the draft and the report and produces feedback. The orchestrator reads everything along the way, because everything is in the conversation by the time the next step runs. By the third agent, the orchestrator’s working surface holds the original task, the researcher’s eight-hundred-word findings, the writer’s two-thousand-word draft, the reviewer’s notes, and the residue of every status update each agent volunteered as it worked. Each agent behaved correctly inside its own boundary. The pipeline degrades anyway, because nobody designed what would cross the boundaries between them.

Trust-based delegation: the mental model#

The frame that holds the rest of the post together, before any specific pattern lands, is trust-based delegation: the orchestrator’s commitment that the rules and instructions it crafted for the sub-agent are good enough that the agent will return the correct result, removing the need to observe every step. The orchestrator isn’t a manager looking over the sub-agent’s shoulder. It’s not a critic reviewing each thought. It’s a caller of a function whose contract it trusts, the way any function caller trusts the function it called.

The mental picture is the same one a senior engineer uses with a smart colleague: hand them a specific task with the context they need, walk away, read their report when they come back. You don’t stand behind them watching them type. You did the work that earned the right not to. The discipline is the same: pay the cost in the brief, leave the run alone.

The shift the model forces is unfamiliar to engineers used to multi-agent demos where the running output is a stream of agent thoughts. The orchestrator doesn’t see most of what the sub-agent does, and doesn’t need to. It sees what the boundary contract said it would see: the result, in the shape it was promised, at the verbosity it was told to honor.

Discrete delegation, restated for this post: a multi-agent design where the orchestrator hands a task to a sub-agent, the sub-agent works in its own context, and the sub-agent returns only the result. The orchestrator doesn’t observe or accumulate the sub-agent’s intermediate state. Every pattern that follows assumes that frame. The default architecture for known-shape production work is isolation, not collaboration. Each agent receives only its task, its context, and its return contract: no other agents’ specs, no shared back-channel, no access to the orchestrator’s accumulated state. The isolation is the design choice that lets the trust model hold. If the agents could talk to each other freely during work, the orchestrator couldn’t trust any single agent’s return, because the return would now reflect not just the spec the orchestrator gave it but also whatever residue the conversation with the other agents had left behind. Trust requires bounded inputs.

Anthropic’s guidance on building effective agents names this pattern as orchestrator-workers. What follows is what the discipline looks like at production scale.

Boundary contracts: shape locked, verbosity calibrated, content open within#

The first pattern is the boundary contract: the agreed-on shape of what crosses an agent boundary, with three components. Shape locked, so the orchestrator can consume the result predictably. Verbosity calibrated, so the result is signal and not transcript. Content open within those constraints, so the agent can include relevant findings the orchestrator didn’t think to ask about.

A return contract written into a task spec looks something like this:

Return contract:
- status: pass | blocked | fail
- findings: max 3, each with file path and reason
- recommendation: one sentence
- open: optional, max 75 words for anything notable not in scope
- Do not include search process or rejected alternatives.

The shape side is the easy half. An orchestrator that consumes structured findings (“status, three highest-impact items, file paths, recommendation”) can act on them mechanically. The same orchestrator consuming a free-form essay has to parse the essay into the same fields itself, on every invocation, and the parsing is a place where the orchestrator’s context fills with the wrong material. Shape recovery in the orchestrator is the multi-agent equivalent of parsing a CSV with no agreed delimiters. It works exactly until it doesn’t, and the failure is silent until something downstream chokes.

The verbosity side is the half that almost everyone underspecifies. Verbosity calibration is the explicit specification of what level of detail counts as the signal the orchestrator needs versus what counts as raw process noise that stays in the sub-agent’s workspace. The default an unspecified agent picks is too long, by a lot. A research task without a verbosity cap returns a six-hundred-word essay describing what was searched, why each result was relevant, and a narrative summary of the findings. The same task with the cap “under two hundred words, three findings, no preamble” returns under two hundred words, three findings, no preamble. The cap isn’t formatting fussiness. It’s the discipline that keeps the result from being a small transcript in disguise.

The third component is the part that gets lost in templates. Within the locked shape and the calibrated verbosity, the agent is free to include findings the orchestrator didn’t enumerate in the spec. A reviewer that finds a security issue the spec didn’t ask about should report it. A researcher that surfaces a contradicting source should flag it. A boundary contract that locks the shape into a fixed-field schema with no room for additional findings punishes the agent for noticing things, which is exactly what you want sub-agents doing. The fields are fixed, but there’s room for an open section. The verbosity cap covers the open section too, so the agent can’t smuggle a transcript in through it.

There’s a fourth flavor of the contract that gets less attention and is worth pulling out. The contract governs what arrives at the sub-agent, not just what the sub-agent returns. Consider a review agent whose job is to compare two drafts blindly, judging only on the texts. The orchestrator’s responsibility under that contract is more than passing the paths. The orchestrator must strip the drafts of any banner text that would reveal which is which, randomize the order in which they are presented, and translate the agent’s verdict back to real identities after the fact. The contract isn’t just “return findings in this shape.” It’s also “you will receive only the two texts, with provenance markers removed, with the order randomized; if you receive any other parameters, ignore them as orchestration noise.” The producer-facing side of the contract is what makes the sub-agent’s blindness a feature instead of a hopeful constraint the agent might or might not honor. Shape and verbosity are the consumer’s constraints. Sanitized inputs and content openness are the producer’s.

The discipline question for boundary contracts is: at every place in your pipeline where an agent hands work to another agent or back to the orchestrator, is the shape of what crosses written down, is the verbosity capped, are the inputs deliberately constructed, and does the agent know what is allowed to come along beyond the fixed fields?

Controlled context distribution: who carries what, and at what cost#

The second pattern covers both directions of the agent boundary. Pre-loading is paying context-assembly cost once at the orchestrator and embedding the result in each spawned sub-agent, instead of having every sub-agent rediscover or reread the same material. The cost of leaving that distribution unmanaged runs in three directions, each compounding with the number of agents involved.

Picture a coordinator that launches four parallel reviewers, each reviewing the same batch of source documents from a different angle. The naive setup gives each reviewer the file paths and lets it open the files itself. Four agents each independently read the same documents. The reads are redundant. They’re also not free: each one pays the file-open cost, the parse cost, the model’s attention cost, and (less obviously) the cost of the four agents’ contexts each holding their own slightly different version of the material at the moment they each happened to open it. Models don’t deterministically pick the same six files when asked, so the slight differences seed slight inconsistencies in what each sub-agent treats as ground truth. The pipeline does N times the orientation work and gets N slightly-different orientations for it.

Pre-loading inverts the pattern. The orchestrator pays the orientation cost once: it reads the files, runs the searches, fetches the doc, and composes a brief that captures what every sub-agent will need. Then it spawns the sub-agents and embeds the brief in each task spec. The sub-agents don’t orient. They start with the orientation already done, in identical form across all of them, and spend their context budget on the slice of work that is uniquely theirs. The savings aren’t just inference cost, though those matter. The deeper savings are in consistency: every sub-agent operates from the same composed context, so the work they return composes back into a coherent picture rather than N pictures with subtly different foundations.

The shape that has worked best is a small, dense brief: the names of the systems involved, the version of any external dependency in scope, the conventions the work must respect, the current state of any in-flight decisions the sub-agents need to honor. Three or four short paragraphs. Sometimes a precedence list. Almost never a transcript of how the orchestrator arrived at the brief. The brief is the cleaned-up output of the orchestrator’s orientation, not its scratchpad.

The wrinkle is the threshold. Pre-loading is correct only when the cost of duplicating the material into every spawn prompt is less than the cost of the agents reading independently. Past a certain size of source material, the duplication cost takes over: four agents each receiving a hundred kilobytes of pre-loaded content in their spawn prompt is more total context than four agents reading the same material themselves and discarding it after the relevant extract. The shape of the rule is a threshold: pre-load when total source content fits under some upper bound, let agents read independently when it doesn’t. The exact threshold is a calibration. The structure of the decision is the durable thing: a measurable property of the material drives an explicit, revisable choice about whether to pre-load.

There’s an inverted version of the same pattern that is worth naming because it shows up under the same conditions and is easy to confuse. Sometimes the orchestrator’s working surface is what needs to stay clean, not the sub-agents’. In that case, the right move is the opposite of pre-loading: the orchestrator assembles nothing, hands the sub-agent a pointer to where the material lives, and the sub-agent does the entire context assembly inside its own session. The orchestrator’s context never sees the source, never sees the assembly, and never sees the intermediate steps. Only the final output returns. This is the right move when the material is large or unique to one task: the orchestrator pays zero context cost, the sub-agent’s context fills with the material it needs, and the assembly debris is discarded with the sub-agent’s session at the end. When N agents need the same material, embed it once at the orchestrator and distribute. When one agent needs material the orchestrator doesn’t, don’t assemble it at the orchestrator at all. Let the sub-agent do its own loading.

When context distribution is left unmanaged on the return side, three costs appear. The first is waste: every agent independently holding the same material pays its own context budget for it, and the orchestrator pays again. The waste is real but bounded. The second is drift: the same material, read at different moments by different agents, can come back as different things. Multi-agent pipelines without controlled context distribution accumulate drift the way long sessions accumulate pollution: the same material exists in N places, and N places means N versions over time. The drift is silent and unbounded. The third is decisional drag: the orchestrator has the correct answer somewhere in its working set but loses it to material that is more salient because it appeared more recently, was repeated across multiple sub-agent reports, or was the dominant pattern in the most recent batch of returns. The pollution-family lens applies at the orchestration scale: the orchestrator’s working surface accumulates noise faster than signal, and decisions made from the polluted surface degrade the same way they would in a single long session, just with the additional vector that each sub-agent is contributing to the noise rate. Decisional drag is the cost that survives even a tight boundary contract. Even when each sub-agent returns a small disciplined result, N small disciplined results can still tip the orchestrator’s working surface against the answer that was already there.

There’s a counterintuitive corollary about controlled duplication that is worth landing. When the alternative is each agent independently looking up the same material at runtime (with the failure modes that produces: lookups that fail, lookups the agent skips, lookups that succeed but return a different version than the spec assumed), deliberate duplication into every agent’s spec is the right design, not a tradeoff. If N sub-agents in N different repositories all need the exact same toggle key and behavior, the spec for each agent should contain the full toggle specification, word for word, rather than referencing a shared document any of the agents would have to look up. The duplication is correct compilation from a single source of truth. Controlled duplication through the orchestrator is cheaper than uncontrolled parallel discovery, every time. Duplication at runtime, where each agent independently re-derives the same material in its own session, is the failure mode. Duplication at composition time, where the orchestrator deliberately writes the same material into every spec that needs it, is the correct response. The two look superficially similar and produce opposite outcomes. The difference is who paid the duplication cost.

The discipline question for this pattern is: for the material that crosses your orchestrator-to-sub-agent boundaries, is the location of the loading work explicit and matched to the case, and where the same material is consumed by multiple agents, is the duplication paid by the orchestrator at composition time?

Checkpoint vs chain: when the orchestrator pauses, when it does not#

The third pattern is structural at a different layer. Once a pipeline has more than one step, the orchestrator has to decide for each step whether to chain into the next one automatically or to pause and wait for human approval. The wrong default in either direction has a cost. Chain too eagerly and the pipeline runs past the point where a human should have intervened, propagating a wrong answer through every subsequent step. Checkpoint too often and the pipeline becomes a series of approval prompts that the human stops reading carefully after the third one, producing approvals that are rubber-stamps in everything but name.

The cleanest articulation of the decision rule belongs to Birgitta Böckeler, whose risk-based review framework for AI-generated code names three axes: probability of error, impact if the error lands, and detectability of the error after the fact. The framework was developed for individual code-review decisions, not pipeline orchestration, but it maps cleanly onto the checkpoint-vs-chain question. Chain through cheap recoverable steps where probability is low, impact is small, or the next step would catch a problem anyway. Checkpoint before expensive or irreversible steps where catching after the fact is too late.

Chain when the next step would refuse to proceed on a wrong input. Checkpoint when the next step would act on the wrong input.

The recoverable test is the one to lead with. If a sub-agent returns a result and the next sub-agent in the chain would obviously refuse to proceed when the result is wrong, the chain is safe. The next agent is the de facto checkpoint. A research sub-agent that returns a list of candidate files and a downstream sub-agent that reads each file and reports it can’t find one of them is a chain that detects its own errors at the next link. The chain is the right call.

The irreversible test is the one to checkpoint on. If a sub-agent’s result will trigger a database migration, a deploy, a billing event, a customer-visible change, or any action whose undo is harder than the original, the chain stops at that boundary. A human looks at the in-flight state, decides whether to proceed, and only then does the orchestrator continue. A factual contradiction between two reviewers is the same shape: continuing past it propagates one of the wrong facts through every subsequent step. A checkpoint there pauses for the one decision the system can’t autonomously make.

The third axis is detectability, and it’s the one teams underestimate. An action whose error would be caught on the next page load is detectable. An action whose error would surface as a slow degradation in some downstream metric over the next two weeks is not. Low-detectability errors are checkpoint candidates even when their impact looks moderate, because the asymmetry between cost-to-pause and cost-to-discover-much-later is enormous.

The same asymmetry shows up at the spawn boundary. Spawning agents that will do expensive work across multiple repositories is harder to reverse than stopping before the spawn. The discipline that has held is to checkpoint before the spawn (present the full team plan, name the agents, name the dependencies, request explicit confirmation), then chain through the actual work without further pauses. The checkpoint sits on the irreversible-or-expensive side of the boundary. The chain runs on the recoverable side.

The trap to avoid is checkpoint inflation: defaulting to a checkpoint at every boundary because checkpointing feels safe. A pipeline with a checkpoint at every step has imported the speed of its slowest reviewer into every chain. The throughput collapses, and the human is no longer functioning as a checkpoint. The structure is theater. Will Larson sharpens the corollary: “LLMs themselves absolutely cannot be trusted. Anytime you rely on an LLM to enforce something important, you will fail.” The checkpoints belong in code. The orchestrator’s pause-or-chain decision is one of those. The pause is mechanical, declared in the orchestrator’s own logic, not improvised by an agent at runtime. An orchestrator that decides whether to pause based on the content of the last sub-agent’s return is relying on the LLM to enforce something important. An orchestrator that checkpoints mechanically before any step touching production state is enforcing it in code.

The discipline question for checkpoint vs chain is: for each step in your pipeline, is the next step robust to a wrong input, or would it consume a bad input as if it were good?

What this is not#

It’s not a claim that mid-task agent-to-agent communication is always wrong. There are problem shapes where the decomposition is genuinely unknown until partway through, where the cost of letting agents coordinate dynamically is less than the cost of trying to specify everything up front. Exploratory research, novel-domain investigation, and certain kinds of debugging fit this shape. The argument here is about production pipelines for known-shape work, where the sub-tasks are bounded enough that specifying them is cheaper than coordinating them.

It’s not a claim that isolation removes the need for the patterns. A pipeline of isolated agents that doesn’t specify boundary contracts, doesn’t pre-load shared material, doesn’t distinguish chain from checkpoint, and doesn’t control context duplication is a worse pipeline than a conversational team would have been. Isolation is necessary but not sufficient. The patterns are what make isolation produce reliable output.

It’s not a solved problem. The threshold for when a sub-task is “well-bounded enough to be specified up front” isn’t something I have a stable, transferable answer for. Calibrate too tight and you spec sub-tasks the agent could have figured out from the brief. Calibrate too loose and the agent fills the gap with substitution and continues. The signal I’ve learned to trust is whether the agent’s return would be different if I had given a more or less detailed brief. If the answer is no in either direction, the calibration is somewhere reasonable. I don’t have a stable answer for the verbosity-cap calibration either. The cap I land on changes with the agent’s role and the orchestrator’s downstream consumption pattern.

The hardest unsolved problem is contract violation recovery. When a sub-agent’s return has the wrong shape or wrong verbosity, the two moves are send-back-and-reformat (which re-embeds the bad return in the next context and propagates the pollution) or discard-and-respawn (which throws away real work and re-pays inference cost). Neither is satisfying, and the recovery question is genuinely open.

The other thing the patterns can’t do is keep themselves up to date. Every boundary contract, every pre-loading threshold, every checkpoint-versus-chain decision is a snapshot of what the team learned from the failures it has seen so far. The decisions get encoded. The work changes shape. The decisions stop fitting. Maintaining the discipline is its own ongoing work, and the cost of letting it drift isn’t a sudden failure. It’s a slow degradation that looks at first like the pipeline is just getting noisier.

Where do your multi-agent boundaries actually live?#

Two halves of the same check, one for the run that just finished and one for the run that hasn’t started yet.

The retrospective half. Look at the most recent multi-agent run that produced an output you had to rework. Trace the work backward through the pipeline. At each step, ask three things about the boundary the work just crossed. Was the shape of what crossed this boundary specified, or was it whatever the agent picked? Was the verbosity calibrated, or did the agent return whatever length felt natural? Was the material the agent worked from loaded by the orchestrator and distributed, or was it loaded by the agent itself in parallel with other agents loading the same thing? If any of those three is missing, the failure was structural, not a model failure or an agent failure. The model did what it always does and produced the most plausible-looking answer it could from what was in front of it. Whether that answer was the right one was decided at the boundary, by whoever did or did not specify what would cross it.

The prospective half. Before the next run spawns, can the orchestrator state, in one sentence each, the shape and the verbosity cap of every sub-agent return it’s about to receive? Can it state which boundaries in the pipeline are chains (because the next step is the de facto checkpoint) and which are checkpoints (because the next step is irreversible or undetectable enough to require a human pause)? Can it state, for the brief it’s pre-loading, which lines every sub-agent provably needs and which are there in case they help? If those questions have answers the orchestrator can produce in advance, the pipeline is running on contracts. If those questions have answers only in retrospect, after the runs have already shown what each sub-agent returned, the pipeline is running on hope, and the difference between the two will compound across every pipeline run the team does this quarter.

The boundary contract isn’t a heavyweight artifact. It’s a sentence or two in a task spec that pins down what the orchestrator will receive and refuses to negotiate the shape and the cap. The freedom inside the contract is what makes the work worth delegating. The constraint at the contract’s edges is what keeps the delegation from polluting everything upstream of it. The discipline is whichever of those two halves the team finds harder. Most teams find the constraint half harder. The pull toward “let the sub-agent decide” is strong, because it feels respectful of the sub-agent’s autonomy and avoids the awkwardness of declaring shape upfront. The autonomy that matters is in the contents. The shape and the cap are the orchestrator’s job, and refusing that job is what produces the consultancy reports the orchestrator never wanted, in a shape it can’t consume, full of bookkeeping it can’t afford to carry.

Orchestration Patterns for Multi-Agent Work