Working With LLMs: Mental Models That Match the Mechanism

A developer opens a fresh session. They paste a four-paragraph role prompt at the top. Then the file tree. Then the README. Then a paragraph about the tech stack. Then, finally, the actual question, which is whether a particular method on a third-party SDK takes its arguments in the order they think it does.

The model could have answered the question by reading the SDK in three seconds. Instead it reads three thousand tokens of context that were not relevant to the question, and then reads the SDK anyway, and then answers. The developer concludes the session went well. They will do the same thing tomorrow.

This is not a prompt-engineering failure. It is a mental-model failure. The developer is operating on a folk model of what the model is, and the folk model produces the behavior. Better prompts do not fix it. A better mental model does.

The folk model is roughly: a small intelligent agent that ingests information, knows things, and remembers what it has been told. The corrected model is structurally different in every load-bearing detail. There is no memory between turns. There is no internal workspace where information is held separate from the bytes in the context. Every turn is a fresh forward pass over the entire context, generating the next plausible continuation. What feels like memory is re-reading. What feels like understanding is generation.

From that corrected mechanism, a partition falls out. The model is fast and accurate at a category of work that bottlenecks the human: verifying SDK signatures, recalling syntax, checking a contract against an implementation, breadth across an enormous library of technical knowledge. The human carries a category of context the model has no access to: the domain, the business rules, the judgment calls that would deviate from a default approach. Effective collaboration is mutual deference based on capability. The folk move is to treat “prompt engineering” as the skill. The actual skill is the negotiation.

What the Model Actually Is#

The folk model gets the mechanism wrong in four specific ways, and the wrongness drives the bad behavior.

There is no memory between invocations. The model does not “remember” your previous turn the way you remember your previous thought. It re-reads the entire conversation every forward pass. What feels like continuity is reconstruction.

There is no internal workspace. The window does not contain a workspace where the model holds a representation of your project. The window is the workspace. Every token is processed every pass through attention. There is nothing held in a separate place.

It is generating, not retrieving. The model does not look up an answer and report it. It generates a plausible continuation token by token. When the continuation happens to match reality, that is correctness. When it does not, that is hallucination. They are the same operation, and the only difference is whether the output is checkable against something true.

It has no privileged access to your codebase, your tools, your runtime, or your team’s prior decisions. Whatever it knows about your situation is in the context window or fetched through a tool call. There is no osmosis. The model cannot accrete shared context the way humans do by being in the room.

The penalties for cluttered context are worse for the LLM than for a human. A human with a cluttered desk has persistent state. They remember what they were doing, what they tried, why they made certain choices. An LLM has none of this. Every irrelevant token in the transcript is processed cold, every turn, forever.

These four facts are the foundation. Almost every “prompt engineering” technique that works does so because it accommodates them. Almost every technique that doesn’t work fails because it’s fighting them.

The Partition Falls Out of the Mechanism#

If the model has no memory between turns, no internal workspace, no retrieval step, and no implicit access to your situation, certain things follow about what work it should and shouldn’t do.

It should do work that benefits from breadth and speed over context: verifying that a method signature matches the docs, checking that an implementation conforms to a contract, recalling the syntax for a feature you use once a year, reading 200 lines of unfamiliar code and summarizing what they do, generating boilerplate against a known pattern. The model is faster and more accurate at this work than I am, and the gap is not small. An agent goes into an SDK repo and verifies a method signature in seconds, more accurately than I’d manage in ten minutes of grep-and-squint.

It should not do work that requires context it does not have and cannot fetch: whether the new feature should match the existing pattern in the codebase or deliberately break with it, whether a particular performance trade-off matters here, whether the business rule you are modeling has an exception you have not articulated. The model can guess, and on familiar territory the guess is often plausible, but plausibility is not correctness. On these calls, it should defer to me, because I know things it cannot.

The partition is the unit. Naming who leads on which work makes the work executable. When neither side names it, both sides hedge, and the work happens twice or not at all.

A worked case: an agent verifies method signatures, return types, and which interface a class actually implements. The human knows that the team agreed last quarter to deprecate the synchronous variant of that method even though the SDK still ships it, and that anything new should use the async form. Neither side could have produced both halves alone. Played correctly, the human asks the agent to verify the contract and then applies the policy. Played the folk-model way, the human types the contract from memory (slower, error-prone, mostly redundant) and then second-guesses the policy (the part where they actually add value).

Prompt Engineering Is Mostly the Human Refusing to Partition#

Most of what gets called prompt engineering is the human refusing to do this partition and trying to compensate with prompt incantations.

Role prompts (“you are an expert C# developer with 20 years of experience…”) try to make the model do the user’s job better. The technical knowledge was always present. What changes the answer is the question. A precise question gets a precise answer. A vague question gets a vague answer dressed up in expert-flavored prose. The role prompt is the user attempting to do their own job (specifying what they want) by addressing the model instead.

Upfront context dumping is the same refusal in a different shape. The developer pastes the file tree, the dependencies, the architecture notes, the recent commit log, before the question. The model does not need most of it and would fetch what it needs from disk anyway, on demand. The upfront paste pays a token cost on every subsequent turn for content that was not needed and dilutes attention: the relevant signal sits inside a thousand tokens of irrelevant scaffolding, and reasoning degrades while well within the context budget.

Both moves are the human keeping control of work that should have been handed off. The model is supposed to fetch. The model is supposed to verify. The user keeps doing those things on the model’s behalf because the user does not yet trust the partition. The trust is the actual skill. Once you have it, the prompts get shorter, the sessions get faster, and the output gets better at the same time.

The popular school, sensibly, has been moving in this direction. “Prompt engineering” as a phrase has been losing ground to “context engineering,” the discipline of curating what the model sees. The shift is the right one. Context engineering still keeps the human as the active party, which is correct: the negotiation lives on the human side. But the popular framing of context engineering still lets the model off the hook on a lot of the work it should be doing, which the partition framing fixes.

Illustration One: Just-in-Time Context#

The developer who pastes the file tree at session start has a folk theory: the model needs context to be useful, so loading more context up front gives it a better starting position. The mechanism makes the opposite prediction. Every token is processed every forward pass. Tokens not relevant to the current question dilute attention and push the relevant material into worse positions. Reasoning degrades while well within capacity. The cost is not running out of room. The cost is reasoning poorly with room to spare.

The cost also compounds. A 50k-token priming prompt costs 50k tokens of input on every turn for the rest of the conversation. Even with caching at 90% off, that’s 5k tokens of full-price input per turn for content the model didn’t need.

The corrected pattern matches how a developer actually works on a real codebase. A developer working a bug does not load the entire codebase into their head. They have a vague spatial sense of where things live and pull files into working memory as needed. The filesystem is the source of truth. Working memory is a small high-bandwidth scratch space.

Eugene Yan frames the same intuition by analogy: “Onboard each new session like a new hire.” The session is a competent newcomer who needs only what’s relevant to the immediate task, plus the ability to ask for more. What goes on the memo? Exactly what they need to answer this question well. Not everything that might be useful eventually. Not the org chart. Not a tour of the office.

The model wants the same shape, for the same structural reason: limited high-bandwidth attention over the current task, large slow-but-accurate external store, fetched on demand.

The headline: pre-load conventions, fetch facts. Conventions shape how the model reasons (your team’s terminology, the patterns you want followed, the gotchas not in any file). Facts are what it reasons over (the actual code, the actual data, the actual interfaces). Conventions earn permanent context. Facts do not. Confusing the two produces the 20KB system prompt that no one would write from scratch but no one is willing to cut.

Illustration Two: Role Prompts Are Mostly Inert#

“You are an expert C# developer.” The folk theory is that this incantation activates a domain expert mode. The mechanism makes the same opposite prediction it made for upfront context. The technical knowledge is always present in the model’s weights. There’s no holding-back-waiting-for-the-right-prefix step.

What feels like activated expertise on a role-prompted session is usually the model’s audience-calibration heuristic responding to the question, not the role prefix. The audience-calibration heuristic is the model’s default behavior of calibrating response register to match the apparent expertise of the questioner. A precise technical question phrased the way a senior engineer would phrase it gets a senior-engineer-flavored answer with or without the role prompt. A vague question phrased like a beginner gets a beginner-flavored answer regardless of the role prompt. The variable doing the work is the question. The role prompt is overhead.

This had a real effect on early models, which is where the folk wisdom comes from. GPT-3.5-era models defaulted to consumer-friendly hedge-and-explain mode unless prompted otherwise. “You are an expert” suppressed the hedging and produced more direct technical output, which felt like activating expertise but was really just suppressing a default. Contemporary frontier models calibrate to context. Ask a technical question in technical language and you get a technical answer. The role prefix is doing redundant work the question itself already accomplished. (This is a version-specific historical observation, not a permanent claim about all future models.)

The narrow legitimate use is honest task specification: “respond at the level of a senior X” as a way of telling the model what level of detail and assumed knowledge to operate at. That is not a role prompt. That is the user specifying their position in the partition. It is the move the user should be making anyway.

Other Threads, in Passing#

The partition framing has several developed threads. They get acknowledged here so the manifesto holds the territory together.

Defaults matter more than capabilities. Models can do many things they don’t do by default. Pulling the right behavior out of them often isn’t about discovering hidden capabilities. It’s about converting a possible behavior into a default for the current task. “You can verify against the SDK” is a capability. “Verify against the SDK whenever you make claims about its behavior” is a default. Most users operate at the capability layer and never at the defaults layer. The leverage lives at the defaults layer.

The user’s position, not the model’s role. Role prompts specify the model’s claimed identity but say nothing about the user’s actual position. The interaction model that gets activated is “two experts talking,” which is wrong when the user is a non-expert seeking guidance. What the user needs is to specify their own position: what they know with confidence, what they have intuitions about, where they’re out of their depth. That produces a far more useful interaction than any amount of role-prompting on the model’s side.

Three kinds of partition. The partition shows up in three patterns. Capability partition: one side is genuinely better-equipped for a category of work, the model with training in statistical methods you don’t, you with institutional knowledge it doesn’t. Comparative-advantage partition: both sides can do something, but one is dramatically faster, like the SDK verification case above. Confidence partition: both sides can do something, but one is more reliable. None of these are obvious until you start naming them. Naming them is what makes them executable.

Adversarial-mode developers. A subset of developers operate in adversarial mode toward the model: scanning for failures, refusing to delegate, moving the goalposts on what counts as the “real” skill as the model’s capabilities expand. The pattern is identity-defense, not capability assessment. The diagnostic is straightforward: ask a developer about a recent task where the model did something well. Developers in collaborative mode have stories ready. Their absence is the marker.

The Amplifier Shape#

A managerial folk theory says AI tools raise a floor: weak developer plus AI equals competent developer. The mechanism is the opposite. The tools amplify whatever the user brings. A multiplier of zero is zero. They don’t make weak developers worse. They expose the actual level by removing typing speed as camouflage for skill.

Will Larson develops the same constraint hierarchy in different vocabulary: time is solved, attention is being solved, judgment remains the binding limit. When time and attention come down, the variable that determines outcome is judgment, and judgment was the variable all along. The new tools just made it visible.

The same shape shows up at the team level. AI tools do not create undisciplined organizations. They expose and accelerate the ones that already existed. The floor is whatever the team brought to the work. The new ceiling is higher. The distance between them stretched, not shrank.

The Deepest Skill in Working With LLMs Is Accurate Self-Assessment#

Effective collaboration with a capable model is mutual deference based on honest capability assessment. The list of things the model is better at is long and varied. The list of things the human is better at is short and concrete: domain context, business rules, judgment calls that would result in meaningful change or deviation from a default. That is roughly the whole list.

The discipline is not knowing the categories. It is actually using them. It is noticing, mid-session, when you are answering a question the model could answer faster and more accurately, and handing it back. It is noticing when the model is making a judgment call you should be making, and stepping in. It is noticing that the question you keep wanting to ask the model is actually a question only you can answer.

Most “prompt engineering” is rituals that correlate with users putting effort into their prompts. The effort is the actual variable. If you can write a clear specification and partition the work honestly, you don’t need the tricks. If you can’t, the tricks help marginally, but the real fix is learning to write clear specifications. The advice industry sells incantations because incantations fit in a tweet. The negotiation does not.

The check you can run on your own current workflow tonight: pick the last three sessions. For each one, identify one moment where you did work the model should have done, and one moment where the model did work you should have done. If you cannot find both in every session, you are either being dishonest with yourself, or you have already internalized this and the rest of the series will not surprise you. Most of us, on most days, can find both without trying hard. The partition is not subtle. It is just easy to ignore.

Working With LLMs Effectively: Mental Models That Actually Match the Mechanism