Attention Is Not Memory: How LLMs Handle Conversation

A developer is forty turns into a session. They told the model on turn fifteen that the project uses snake_case for database columns and camelCase for API fields, and that the boundary between the two is the serialization layer. On turn forty the model produces a SQL query with camelCase columns. The developer types, “I told you we use snake_case for columns,” and the model apologizes and rewrites it.

Most readers will recognize this. The instinct that follows is to say the model “forgot.” Some readers will go further and say the model has bad memory, or that the conversation got too long for it to track. Both reactions come from the same intuition: the model is a small intelligent agent that remembers what it has been told, and longer conversations strain the remembering.

The model has no memory at all between turns. It is not remembering badly. It is not remembering at all. Each turn, the entire prior transcript is shipped back to the model, and the model produces the next response by attending over that transcript end to end. What feels like memory is re-reading. What feels like forgetting is the same act of re-reading, done over a transcript where the relevant signal got buried.

Each turn is a discrete invocation#

There is no persistent process on the other side of the conversation that is holding state for the developer between turns. There is no “the model” that gets handed turn forty and remembers turn fifteen. There is a stateless function that, when called, takes a transcript as input and produces the next message as output. The provider’s product wraps that function in a chat UI that makes it feel like a continuous conversation, but the wrapping is a UI affordance. The mechanism underneath is request-response.

When turn forty arrives, the API call carries every prior turn in the transcript. The model attends over the full sequence in a single forward pass. It doesn’t have access to a separate memory of what was discussed earlier. It doesn’t have a compressed summary of the established conventions. It has the bytes in front of it, and it has its training. That’s everything.

This is one of those facts that sounds technical and small until you sit with what it implies. Every appearance of continuity across turns is the developer’s experience, not the model’s. The model has no experience of turn fifteen when it is producing turn forty. It re-encounters turn fifteen for the first time, every turn, forever.

”It forgot” is almost always “the signal got buried”#

The intuition says the model is forgetting older context as the conversation grows. The mechanism says the older context is still right there in the transcript, processed in full on every turn, and the failure is that attention isn’t finding it.

Two specific mechanisms drive the apparent forgetting. The first is attention dilution: the relevant signal for the current turn is competing for attention with everything else in the transcript, and the longer the transcript, the higher the noise floor. The convention established on turn fifteen is two sentences in a forty-thousand-token transcript. The model’s attention has to suppress thirty-nine-thousand-and-something tokens of more recent material to weight those two sentences correctly. Sometimes it does. Often it does not.

The second is lost-in-the-middle. Models attend more strongly to the beginning and end of the context than to the middle. A convention established on turn fifteen lives in the worst attention position by turn forty: too far from the front to ride the opening, too far from the end to ride the recent. The convention is exactly where attention is weakest. The newer messages, including the one that should have triggered recall, are at the end and pull attention to themselves. The result feels like the model forgot something old. The mechanism is that it found something newer to focus on instead.

When the model contradicts something established earlier, two things might be true. The bytes might no longer be there: the session was continued from a stale summary, the transcript got compressed, a tool call clobbered the relevant content. Or the bytes are still there and attention didn’t find them. Both look identical from the user’s seat. They have completely different fixes.

The first case is corruption. Re-pasting the missing constraint actually restores state.

The second case is signal burial. The bytes are there, competing with twenty thousand other tokens that have accumulated since: file reads that didn’t lead anywhere, intermediate reasoning, build output, the model’s previous attempts. Re-pasting doesn’t restore anything because nothing was missing. It moves the relevant content to the end of the transcript where attention is strongest. The fix worked, but not for the reason the user thought.

Once you hold this distinction, the diagnostic move falls out. When the model misses something, don’t ask whether it remembered. Ask whether the signal would be findable on a re-read of this transcript. If a thoughtful human, given only the transcript and the current question, would have to scroll up and search for the convention, the model is in the same position. The model isn’t forgetting. The transcript is asking too much.

The persistence features are not the exception#

The most common reader objection here is the wave of memory-style features that vendors have shipped over the past two years. Conversation history. Project-scoped memory. The model that “remembers” your preferences across chats. Pinned context. Custom instructions. If the model has no memory between turns, what are these?

They are bytes in the context, packed in by the harness at request time.

The provider’s chat product, when memory is on, captures candidate facts during a conversation and stores them somewhere durable. On the next request, those facts are formatted and prepended to the system prompt, or interleaved into the message sequence, before the request is sent. From the model’s perspective on that next turn, nothing has changed. It is still a stateless function being called with a transcript. The transcript now happens to include a synthesized “things to remember about this user” block. The model attends over it the same way it attends over everything else. If the relevant fact is in the block, attention may find it. If the block is long and the current question is far from the relevant entry, attention may miss it. The familiar failure modes return.

Naming this isn’t a complaint about the features. They do real work, especially for the case where a user wants the same standing context applied across many sessions. The point is structural. Persistence features change what gets put in the workspace. They don’t give the model a workspace that survives between turns. The mechanism on the model side is unchanged.

The vendors themselves have been quietly drifting toward this framing in their own docs. Earlier marketing language for memory features leaned on “remembers” and “learns about you.” Newer documentation more often describes the mechanism: the system extracts facts, stores them, and includes them in future requests. The honest description is the mechanism description.

The right way to use memory features in practice is the same question one layer up: what to capture, what to leave out, when the accumulated memory file becomes the same kind of bloat problem a long CLAUDE.md becomes. The disciplines that govern session-level context probably apply unchanged. I haven’t seen this tested at the scale where it would matter.

The developer’s diagnostic move#

When the model misses something it was told earlier, the reflex is to repeat the instruction louder, often with a frustrated preface like, “as I said before.” This is the worst available move. It adds another layer of recent text to the transcript, which pulls attention further toward the recent end and away from wherever the original instruction lived. The instruction the developer wants the model to honor is now even harder to find.

The competent move is to scroll up, find the original instruction, and copy it into the current turn. Not “as I said before.” The literal text of what was said before, in the current message, where attention is strongest. This works because it puts the signal where attention is, not because the model is being more obedient the second time. The developer has done the work attention was supposed to do.

The same logic, generalized, is the discipline. Treat the transcript as a workspace the model re-reads from scratch every turn. When the relevant signal is buried, pull it forward. When the workspace is too cluttered to make pulling forward useful, start fresh and bring only what matters. The model isn’t failing to remember. The developer is failing to compose.

The continuation incident#

The cleanest version of corruption I’ve watched happen was a session that ran the context window out and had to be split. The first session ended with a captured summary intended to bootstrap the next one. The second session started with that summary and a directive to keep going. Within a few turns it began re-investigating questions the first session had already settled.

The second session never had the answers. The summary was wrong in a few places: the wrong commit was named as the working state, the list of remaining files was a turn out of date, and one resolved question was described in language that read as still-open rather than settled. The second session took the bytes it was given, reconstructed a model of where things stood, and the reconstruction had errors. The first session had real state. None of it survived. The second session built fresh from a lossy summary, and anywhere the compression had errors, the new session believed wrong things.

The mitigation that worked wasn’t more careful prompting on the second session. It was tightening the artifact: capturing concrete state (specific commit SHAs, exact remaining file lists, settled vs. open questions in unambiguous form) rather than narrative. Narrative was where the errors had crept in. Narrative was always going to be where they crept in, because narrative compresses lossily and the model can’t tell the difference between a faithful summary and a slightly drifted one.

The shape generalizes beyond this one incident. Whenever a session is continued, restarted, or handed off, the new session’s reality is whatever the artifact says. The artifact is the substrate. If the artifact lies, the new session believes it.

Sub-agents and fresh sessions are the structural answer#

If memory is re-reading, the lever is what is in the transcript when the re-read happens. Two patterns follow naturally, and both look like discipline more than tooling.

The first is the fresh session. When a session has accumulated forty turns of exploration, half of it dead-end, and the developer wants to push forward on a different angle, the right move is often to start a new session and bring forward only the result. The rationale is structural. The forty-turn transcript will be re-read every turn of whatever comes next. Most of it is noise relative to the new direction. The dead-end exploration is now actively in the way: it is competing for attention with whatever the next question needs, and it is closer to the recent end of the transcript, which is the worst place for old noise to sit. Starting fresh is not throwing away progress. It is composing a smaller workspace that the next turn can re-read efficiently.

The threshold for when to start fresh and when to keep going is one I’ve moved several times. Some sessions stay productive past a hundred turns. Others go sour at twenty. The variables don’t compose into a rule I can quote: how much exploration produced dead ends, how much of the early transcript is still load-bearing for the current direction. The heuristic I have: when I find myself repeating instructions that should already be in the transcript, the workspace has degraded and starting fresh costs less than continuing.

The second is the sub-agent. The popular framing of sub-agents emphasizes specialization, as if the value were a “database expert” or a “frontend expert” persona. The mechanism that actually pays is context isolation. A sub-agent runs with a clean context containing only what its scoped task needs, produces a result, and returns the result rather than the working context. The parent session re-reads the result on subsequent turns, not the working scratch that produced it. The point is not that the sub-agent knew a domain better. The point is that the parent’s transcript stayed small.

This convergence isn’t an accident or a fashion. Independently developed agentic workflow systems have arrived at the same answer for the same reason. One framework implements fresh context per task at the prompt level. Another implements it at the infrastructure level via a bash loop that spawns a clean instance for each iteration. A third frames it as the design-time discipline of writing self-contained task specs that don’t need conversation history to make sense. Different teams, different stacks, same structural response. The same shape appears in public practitioner writing too: the framing of each new session as a blank slate that requires explicit onboarding is the direct operational consequence of the discrete-invocation mechanism. When multiple independent designs converge on the same shape, the shape is usually telling you something about the constraints.

Both patterns are the same shape: control what gets re-read by controlling what’s in the workspace. Drift across a long session isn’t the model getting tired. Drift is the failure of letting the workspace fill with material the model is going to keep attending over forever.

The discipline question#

When the model next misses something you established earlier in the conversation, don’t type the correction. Open the transcript. Find the place where you said it. Read what’s between that place and the model’s current turn.

How much of what you find is actually load-bearing for what you are doing now? How much is dead-end exploration, false starts, instructions you have since revised, output from tools you only ran once? If a careful human reader, given only this transcript and the current question, would have to work to find the convention you established on turn fifteen, the model is in the same position.

The model isn’t failing to remember. You’re asking it to find a needle in a haystack you packed yourself. The fix isn’t a better model. The fix is a smaller haystack.

Attention Is Not Memory: Why Models Re-Read Everything Every Turn