Tiered Planning for AI Agents: Validation Gates That Work

A spec arrives at an implementation agent. It says, in passing, that an upstream scheduling service returns a 503 status when a particular feature toggle is disabled. The agent treats that as ground truth. It writes the retry loop, the error branch, the logging path for the never-seen status code, the batch-abort logic that fires when the 503 propagates. The code compiles. The unit tests, which test the implementation against the spec’s assumption, pass. Code review passes; the reviewer has no more visibility into the upstream service’s actual contract than the agent did. The work ships into a PR.

The upstream service does not return 503 under that condition. It never has. The whole assumption was fabricated at the moment the spec was written, by a planning agent that had no way to verify the claim and felt no need to flag the gap. Four gates passed, all of them downstream of the same wrong input.

This is not an exotic failure mode. It is the normal failure mode of single-pass planning when an agent is the implementer. The fix is not “add another gate downstream” or “make the model smarter.” Both miss the shape of the problem. The fix is structural. It lives several steps further back than the place the failure surfaced, in a discipline I’ll call tiered planning: planning split into tiers, with a validation gate between each one.

The hidden checkpoint that single-pass planning relied on

Senior engineers who came up implementing from human-written specs share an intuition that hides a load-bearing assumption: when something in the spec does not make sense, the implementer will pause and ask. Sometimes that pause is a Slack message. Sometimes it is a comment on the ticket. Sometimes it is just a long stare at the screen followed by a walk to the coffee machine and a different question to a teammate. Whatever shape the pause takes, it acts as a hidden checkpoint: an unscheduled, undocumented validation event triggered by the implementer noticing the spec is incomplete.

The hidden checkpoint is what made flat, single-pass planning workable for so long. The spec did not have to be complete because the implementer would close the gap by asking. The PM did not have to think through every edge case because engineering would surface them. The architecture diagram could leave a service unspecified because the engineer building against it would notice and push back. The whole upstream cadence was looser than it looked, and the hidden checkpoint absorbed the slack.

Agents do not pause. This is the agent-pause assumption: the implicit human mental model where pausing to ask is the response to something that does not make sense, an assumption agents violate by filling the gap with the most plausible-looking answer and continuing. When an agent is missing concrete information it needs, it picks the most defensible value, the most reasonable-looking interpretation, the most natural extension of what is in front of it. Then it builds. The output looks competent because the substitution looks reasonable in isolation. The failure shows up later, somewhere downstream, in a place that no longer obviously traces back to the original guess.

The 503-returning-service is one shape of this. Others: an agent inventing a database column because the spec mentioned a field name that does not exist; an agent picking the wrong architectural pattern because the spec did not say which one applied; an agent writing against an API endpoint with a route string that was plausible at first glance and wrong at runtime. None of these look wrong at the implementation site. The error was already baked in at the input. The implementation agent compiled it faithfully.

The hidden checkpoint is gone because there is no implementer at the keyboard to notice the gap. Whatever discipline used to live in “the engineer will catch it” now has to be installed upstream, deliberately, as part of planning itself. There is no other place for it to live.

Where spec-driven development holds

The popular response to this problem is to treat the spec as the planning artifact and to invest the engineering discipline in the implementation phase. This is the agentic-coding school in its mainstream form, and it is not a strawman. It works for a range of cases and has produced the most visible recent improvements in agent-assisted development. Drew Breunig’s Spec-Driven Development Triangle (March 2026) is the strongest contemporary expression: spec, tests, and code as three mutually-updating nodes, with implementation feedback maturing the spec over time.

The strongest version of the case goes like this. Planning is the spec. Once the spec exists in a form rigorous enough that the agent’s output can be verified against it, the leverage is in the implementation loop, not in adding more upstream ceremony. Invest in implementation-phase discipline: better prompts, more rigorous verification tests, automated review steps, structured deviation handling, sub-agent orchestration for the tricky parts. Let the spec be the planning artifact. The agent will implement against it, the conformance tests will catch divergences, and the spec itself will improve as implementation surfaces what the spec did not anticipate. Breunig’s framing is unambiguous: “A skill is a suggestion. A tool needs to be a checkpoint.” The checkpoint lives at the implementation boundary.

When the work is small and well-scoped, this is right. Emulation, porting, anything where a reference implementation supplies the conformance tests for free: the spec is bounded, the implementation agent has enough to verify against, and the verification loop closes cleanly. Upstream rigor beyond the spec itself would cost more than it earns back.

The rest of this post develops the case where the work is not small and well-scoped, where the spec contains assumptions the implementation agent cannot verify on its own, and where downstream verification cannot tell the difference between a faithful implementation of a wrong spec and a faithful implementation of a right one. Under those conditions, implementation-phase discipline cannot be enough. The failure mode it needs to catch was decided two layers up, before there was any code for the conformance tests to evaluate.

The compounding-failure mechanism

What happens when an upstream assumption is wrong is the same shape regardless of what the assumption was about. Call it the compounding-failure mechanism: one bad assumption made at the broad-shape tier silently shapes every downstream tier, so the failure surface ends up several layers from the original cause.

The walk is mechanical. The broad-shape tier accepts an assumption (this service returns 503; this feature requires a new column; this workflow has three states). The mid-level decomposition builds on it (we will need a story for the new column; we will need a migration; we will need a state machine). The narrow specifications are written against the decomposition (here is the schema, here is the migration script, here is the state-transition table). Implementation is written against the narrow specs. Tests are written against the implementation. Code review compares implementation to spec. Verification gates compare output to expected behavior.

At every stage, the work is competent. At every stage, the output is faithful to the input. At every stage, the failure is invisible because the input is itself the failure, and faithful-to-a-wrong-input produces a result that looks correct at every gate that does not happen to know the input is wrong.

This is the planning-tier instance of inheritance failure: the broader pattern where an upstream assumption, output, or decision silently shapes everything downstream of it, with the cause and the symptom several layers apart. It is a member of the pollution family: a class of failure where a working surface (a context window, a multi-agent transcript, a feedback dataset, a planning artifact) accumulates noise faster than it accumulates signal, and the surface’s quality degrades not from running out of capacity but from filling with debris. In the planning-tier case, the debris is wrong assumptions inherited downstream without flagging.

The fabricated 503 is one instance. There is also the unverified database trigger that lets a SQL statement silently roll back, producing wrong behavior at runtime with no visible error. There is the stale code reference pointing at a method renamed three sprints ago, where the agent confidently writes code calling the old name. There is the route string committed to the spec at /v1/health when the actual pattern was /health/v1, because someone wrote it from memory rather than verifying. Each is a single upstream commitment, made before there was anything to verify against, that compiled into hours of downstream work and then surfaced only when a human happened to notice.

The asymmetry is the point. Catching one of these in planning is on the order of thirty seconds of verification. Catching it after implementation is the implementation hours, the test-writing hours, the review cycle, the rework, and the loss of confidence in the surrounding work that came along for the ride. Single-pass planning is a bet that the upstream commitments are right. Tiered planning is the discipline that refuses to take that bet.

Iteration at the planning tier

The senior engineer reading this has accepted, for years, the discipline of small batches at the implementation tier: trunk-based development, short-lived branches, frequent integration, automated tests run on every commit, red-green-refactor at the unit level. The stack exists because long-running implementation with a single integration event at the end is too risky. Cost of being wrong scales superlinearly with batch size; small batches bound it.

Tiered planning is the same discipline applied to the planning function itself: treat planning as wide-to-mid-to-narrow tiers with validation checkpoints between them, mirroring how iterative development is applied at the implementation tier. A single-pass planning batch that runs from initial requirements to fully-detailed implementation specs, with validation deferred to the implementation phase, is the planning-tier equivalent of a six-month feature branch with no integration tests until merge.

The reason the implementation-tier discipline became universal is not that engineers got better at planning. It is that the cost of being wrong at the end of a long batch was so visibly catastrophic that the practices to bound it became unavoidable. Breunig has described the current moment as one where “agentic engineering enables waterfall volume at the cadence of agile.” The volume side of the equation has scaled, and the integration risk that small-batch implementation discipline existed to bound has scaled with it. The same cost exists in planning. It is less visible because planning failures masquerade as implementation failures (or as model failures, or as “the agent lost the thread”), but the cost is real and it is being paid every time a downstream chain of work inherits an assumption nobody validated.

The tiered structure is a bet that running planning as small validated batches costs less than running it as a single batch and absorbing the inheritance failures downstream. In the work I have seen, this bet has been straightforwardly correct.

Three tiers, three kinds of work

The three tiers are the broad-shape tier, the mid-chunk tier, and the narrow-spec tier. Each produces a different kind of artifact. Each answers a different question. Conflating them is most of how single-pass planning produces the inheritance failures it produces.

The broad-shape tier answers what the work is at the level a senior engineer would describe it in a one-paragraph Slack message. It produces a shape-of-work artifact: which systems are touched, what the rough scope is, what is confirmed and what is conditional, what the open questions are. It does not commit to specifics. It deliberately leaves things vague where the specifics have not been earned. The output is breadth without depth, and the discipline of the tier is to resist depth before breadth has been validated.

The mid-chunk tier answers how the work decomposes into chunks that can each be planned independently. It takes the validated shape-of-work and produces a story-outline artifact: a list of chunks, each with a one-or-two-sentence description, each with a rough indicator of which system or component it touches. It does not write specs. It does not commit to implementation-level decisions. It is a structural decomposition, sized so the next tier can handle each chunk independently.

The narrow-spec tier answers, for one chunk at a time, what the implementable specification is. This is where detail commits: acceptance criteria, verification steps, API contracts, schema, route strings, exact behavior expected. By the time work reaches this tier, the broader assumptions have been validated, the decomposition has been validated, and the narrow tier can commit to specifics knowing the foundations are sound.

The discipline that governs the boundaries between tiers is the rest of this post.

The validation gate

Between each pair of tiers sits a validation gate: a structured review of the assumptions about to be inherited by the next tier, not a status meeting. Its output is the inheritance set: the explicit list of which upstream assumptions are committed (the next tier may build on them) and which are open (the next tier must revisit them).

The gate does not approve work for being thorough or for looking like a planning artifact. It approves specific assumptions for being valid, and it marks the rest as still open. The work product is the inheritance set itself, written down, in a form the next tier can actually use.

Without an explicit inheritance set, the gate defaults to approving everything by silence. A planning doc gets shown around, nobody objects to anything specific, the next tier inherits the entire doc as if every claim had been validated. In practice, the only thing that got validated was the part the reviewer happened to focus on; everything else flows downstream as a tacit assumption. This is how the broad-shape tier ends up smuggling implementation-level commitments into the narrow tier, hidden inside what looked like high-level work.

The gate’s discipline is the inverse: nothing is committed by default. The reviewer marks, explicitly, which assumptions are now committed and which remain open. An open assumption is not a failure; it is a successful flag that downstream work needs to revisit it. A wrongly-committed assumption is the failure, because it lets the next tier build on a foundation that has not earned the trust.

A concrete shape. An open question at the broad-shape tier asks whether a new database column is needed or whether an existing field already carries the information. The single-pass version of planning would commit to “yes, new column” and write a migration story and a schema change as part of the broad-shape doc. The validation gate’s discipline is to refuse the commitment until the question is answered. The reviewer checks the existing field, confirms it carries the information, and the gate’s output flips that question from open to “no migration needed; existing field reused.” The mid-chunk decomposition is now built on the validated answer, and the migration story that would have been written never gets written.

Catching that question at the gate is one verification step. Catching it after the migration story is written, the schema change is committed, the implementation hours spent, and the PR in review is some multiple of that. The asymmetry is what makes the gate worth the time it takes.

The gate does not eliminate inheritance failures. It bounds them. Some committed assumptions will turn out to be wrong, and that work will compound downstream until the wrongness surfaces. The gate’s job is not perfection; it is to catch cases where the wrongness was knowable at the tier where it was committed, and to refuse to commit until the knowable wrongness has been investigated. Inheritance failures that survive a disciplined gate are usually genuinely unknowable at the tier they were committed in. Those are tolerable. The avoidable ones are what the gate exists to prevent.

Why the broad-shape tier must stay underspecified

The hardest discipline in tiered planning is not adding a tier. It is not running the validation gate. It is the temptation to commit to specifics in the broad-shape tier, before the broad shape has been validated.

Every senior engineer who has done implementation work knows what implementable specifications look like, and is good at writing them. When asked for a broad shape, the trained instinct is to start writing what looks like a spec: route strings, schema definitions, exact behaviors, complete acceptance criteria. The work feels productive. The artifact looks substantial. The reviewer finds it impressive.

The artifact is also wrong, in a way that takes time to see. By committing to specifics before the foundations are validated, the broad-shape tier has smuggled in narrow-tier work without earning the right to make those commitments. The gate is now in a worse position: the reviewer is asked to validate route strings and schemas alongside the actual broad-shape questions, and the route strings absorb the attention because they are concrete and easy to react to. The actual broad-shape questions (is this the right scope; are these the right systems; is this assumption about the upstream service even verifiable) get less attention because they are vague and require thought to engage with.

The result is a gate that committed to specifics it had no business committing to, and left the actual foundational questions under-examined. The downstream tiers inherit the over-committed specifics and discover, eventually, that the route string was wrong, or the schema was wrong, or the acceptance criterion was wrong. Worse, they may not discover it at all, because the agents implementing against the specifics have no way to tell the specifics were guessed rather than derived.

The discipline at the broad-shape tier is to produce breadth without depth. Describe the work at the level the senior engineer’s one-paragraph Slack message would describe it. List the systems. Sketch the scope. Flag the open questions. Resist the urge to write a route string. Resist the urge to write a schema. Those belong to the narrow-spec tier, where the foundations have been earned.

Two failure variants are recognizable enough to name. The first is in-session drift: someone in the broad-shape session asks a narrow question (are these columns in alphabetical order, or do they follow the consumer’s order?) and the discussion goes there. The session, which was supposed to produce a shape, has now produced a column-ordering decision. The decision was made without the discovery the narrow-spec tier exists to do. The narrow-spec tier inherits the decision as a fact because it was decided in the broad-shape pass and nobody flagged it as a placeholder. When the narrow-spec tier later finds evidence the decision was wrong, the discussion has to fight its way back upstream against an already-committed call. The discipline is to say: that question belongs to the narrow-spec tier. We will mark it open at this tier and answer it then.

The second variant is structural. The team holds the tiers apart but lets the same person write two adjacent tiers in the same sitting, against the same notes. The validation gate between the broad-shape pass and the mid-chunk pass is nominally there. In practice, the person who wrote the broad-shape note is the same person who wrote the mid-chunk decomposition, and they did it in the same hour, and the only reviewer the mid-chunk decomposition received was the writer themselves. The blind spots of the broad-shape pass carry directly into the mid-chunk pass because nothing in the process forced them to surface. The structure is theater. The mechanism is not running.

Both variants produce all the artifacts. Both pass through gates that look like gates. Both ship inheritance sets that match what single-pass planning would have shipped, because the discipline that made the structure mean something never engaged.

This is harder than it sounds. The broad-shape tier feels insubstantial when done correctly, and the artifact looks unfinished compared to a single-pass planning doc that contained all the specifics. The discomfort is the discipline. A broad-shape artifact that looks finished has almost certainly committed to specifics it should not have committed to.

What this looks like when it is running

When the discipline holds, the broad-shape tier produces a short, deliberately incomplete artifact. Senior engineers and product partners read it, work the gate, and write down a small inheritance set: this scope is committed, this decomposition axis is committed, these three questions are explicitly open and the mid-chunk tier will resolve them. The mid-chunk tier takes that inheritance set, does its decomposition work, surfaces the open questions inside each chunk, and writes its own artifact. Another gate. Another inheritance set. The narrow-spec tier writes detailed specs against committed mid-chunk decompositions and known open questions, surfaces what its discovery exposes, and feeds back any inheritance-set revisions the prior tiers need to absorb.

The work is slower at the planning tier than single-pass planning. Three artifacts get produced instead of one. Two gates run instead of zero. Discovery happens at three points instead of being deferred to implementation. The slowness is the entire payoff: every assumption caught at the planning tier is an assumption the agents do not silently inherit. The planning function gets larger; the implementation function gets smaller and more reliable. The total system gets faster because the work that does not have to be unwound is more valuable than the work that does.

The slowness is also where teams give up on tiered planning, usually before they have run a full cycle of the discipline. The first time a broad-shape session takes longer because someone insisted on leaving a question open instead of answering it on the spot, it feels like overhead. The first time a mid-chunk gate surfaces a decomposition issue and sends the broad-shape work back for revision, it feels like waste. It is not waste. It is the cost of catching the problem at the cheapest tier it could have been caught at, instead of the most expensive tier the team would have caught it at by default.

The teams that get the compounding payoff are the ones that hold the discipline through the early-feeling-like-overhead phase and into the period where the gates start catching things that would have shipped otherwise. The teams that bail out before the gates start catching anything end up with single-pass planning plus extra meetings, which is worse than single-pass planning, and they conclude tiered planning does not work. What did not work was running the structure without the discipline the structure was designed to scaffold.

What tiered planning is not

A few clarifications, because the tiered structure is easy to misread.

It is not a heavyweight process. The three tiers do not require three meetings, three documents, three review cycles, or three weeks. The broad-shape tier might take an hour. The mid-chunk tier another hour. The narrow-spec tier varies with the chunk. Total time, for work of meaningful scope, is typically less than the time that gets burned recovering from a single inheritance failure. The structure is heavy where the alternative is heavier.

It is not waterfall. The tiers are not phase gates with sign-offs and PMO involvement. They are batches of planning work with explicit validation between batches, the same way iterative implementation has explicit validation between batches. The tiers are sequential the way implementation iterations are sequential: each one builds on the validated output of the previous.

It is not a defense against everything. An assumption that nobody at the gate had reason to question (because it sounded right and looked verifiable) will pass through. A reviewer who skims the inheritance set rather than engaging with it will commit to assumptions they did not actually validate. A team that runs the gates mechanically rather than substantively will produce all the artifacts and absorb all the inheritance failures anyway. The discipline is necessary but not sufficient. It is the floor, not the ceiling.

It is not a solved problem. The threshold at which a tier is “complete enough” to gate is not something I have a stable, transferable answer for. I have moved it several times. Calibrate too tight and the gates become bureaucratic and the team starts gaming them. Calibrate too loose and the inheritance failures leak through. The signal I have learned to trust: a gate that has not failed in a long time is probably too lenient, and a gate that fails on every pass is probably too strict. Both are signals to recalibrate. I do not have a stable answer either for who runs the gate or in what medium. Some teams run it as a small group review. Some run it as a one-on-one read between writer and designated reviewer. The shape that has worked best in the work I have done is asynchronous with a written response, because the act of writing the response surfaces assumptions a verbal review tends to skip. That is preference, not principle. The principle is that someone other than the writer reads the artifact and writes down the inheritance set; the medium is negotiable. The right calibration depends on the kinds of work the team does, the failure shapes the team has already seen, and the team’s appetite for upstream rigor relative to downstream firefighting. I am skeptical of anyone who claims a clean answer here. The honest answer is that the threshold gets tuned by hitting failures and adjusting.

Is your tiered planning gate actually running?

Look at the last implementation that failed in a way that surprised someone. Trace the failure backward. Find the layer at which the wrong assumption was first made. Was there a checkpoint at that layer that could have caught it? If it existed, was it a structured review that produced an explicit list of validated assumptions? Or was it a status meeting where the planning doc got shown around and nobody objected?

If the checkpoint did not exist, that is the gate the team is missing. If it existed but did not produce an explicit inheritance set, that is the discipline that needs sharpening. If it produced an inheritance set and the wrong assumption was on it as committed, the gate was working as intended and the failure was genuinely unforeseeable at the layer it was committed in. That is the case the discipline does not promise to catch.

The prospective version of the same check, for the next planning decision rather than the last failed one: between the broad-shape decision about to be made and the first detailed spec that will be written under it, what is the validation gate? If the answer is “the same person will write both, in the same sitting, against the same notes,” there is no gate. The structure may be there. The mechanism is not running. The next agent that implements against that detailed spec will inherit every assumption the broad-shape decision made silently, and the failure that surfaces will be several tiers from the cause. If the answer is “the broad-shape decision will be reviewed by someone other than the writer, and the inheritance set will be written down before the mid-chunk work begins,” the gate is running. Whether it is calibrated correctly is a separate question. Whether it is running at all is the first one.

The check is not whether the team has tiered planning. It is whether the team has gates that bound inheritance failures, and whether those gates produce the explicit, written, usable output that lets downstream tiers know what they may safely build on. Without the explicit output, the gate is a meeting. With it, the gate is the mechanism that keeps single bad assumptions from compiling into hours of competent-looking, faithfully-implemented, structurally wrong work.

The hidden checkpoint that backstopped single-pass planning for human implementers is gone. The replacement is not a smarter agent or a bigger context window. It is a planning function that no longer hands its mistakes downstream to be discovered by the implementation that inherited them.

Tiered Planning as Engineering Discipline