AI Agent Memory Systems Explained: Short-Term, Long-Term, Vector Stores
Every capable AI agent has the same underlying problem: the model itself remembers nothing between sessions. Once a conversation ends, the context window is gone. The next time you open a chat, the agent starts from a blank slate. For one-off tasks that is fine. For any agent that needs to work with you over days, weeks, or across multiple projects, it is a serious limitation.
Memory is what separates a stateless question-answering tool from an agent that actually learns how you work, what your codebase looks like, and what you tried last Tuesday. Understanding how agent memory systems are built helps you choose the right architecture, debug agents that seem to forget things they should know, and evaluate frameworks that make memory claims in their marketing.
This guide covers all the moving parts: the types of memory agents can use, how they are implemented technically, and how modern frameworks wire them together.
Why the context window is not enough
The most obvious form of agent memory is the context window itself. Everything in the current conversation - the user messages, the assistant replies, tool call outputs, system instructions - lives in context. The model can reference any of it when generating its next response.
For short tasks this works perfectly. Ask an agent to refactor a function, and it holds the original code and your feedback in context until the job is done. But context windows have hard limits. Even the largest models today cap out somewhere between 128,000 and 1,000,000 tokens. That sounds large until you factor in a long codebase, a week of conversation history, meeting notes, and reference documents all competing for the same space.
There is also cost. Running large contexts is expensive. Every token in the context window is processed on every inference call. An agent that naively dumps its entire history into every prompt becomes slow and costly as sessions grow.
The result is that production agents need memory systems that exist outside the context window - persistent stores that can be queried selectively, loading only what is relevant for the current step.
Short-term memory: working memory during a task
Short-term memory in agents refers to the information that lives in context during a single task or session. It includes everything the agent has seen and done since the current session began.
This is the scratchpad. The agent can use it to track where it is in a multi-step plan, remember what tools it has called, and keep intermediate results available without re-fetching them. A coding agent working through a bug fix uses short-term memory to hold the error message, the relevant code files it has already read, and the hypothesis it is currently testing.
The main challenge with short-term memory is management. As tasks get longer, context fills up. Good agents handle this in one of three ways:
Sliding window: Older messages drop out of context as new ones arrive, keeping only the most recent N tokens. Simple to implement, but the agent can lose critical context from earlier in the task.
Summarization: The agent periodically compresses older parts of the conversation into a summary, which takes less space than the raw dialogue. This preserves the gist while freeing room for new content.
Selective retention: The agent identifies which parts of the history are load-bearing and keeps those, discarding the rest. This is harder to implement correctly but produces the best results.
The Letta framework was built specifically around the problem of managing context under pressure. Its approach uses fixed memory blocks with defined sizes - one block for core facts about the user, one for the current task, one for recent conversation - and explicit read/write operations to update them. That structure forces the agent to be deliberate about what it keeps in working memory at any moment.
Long-term memory: persistence across sessions
Long-term memory is information that survives after a session ends and can be retrieved in future sessions. It is where agents store facts that should remain available indefinitely: user preferences, project context, past decisions, accumulated domain knowledge.
The implementation is almost always an external database. When a session ends (or during a session, at defined intervals), the agent writes important information to a store. At the start of a new session, the agent retrieves relevant facts from that store and injects them into context before beginning the task.
The key question for any long-term memory system is retrieval: when you have thousands of stored facts, how do you decide which ones to load into the current context? This is where the architecture choices start to matter significantly.
Episodic vs semantic memory
These terms come from cognitive psychology, and they map cleanly onto agent memory design.
Episodic memory records specific past events: "On Tuesday, the user asked me to refactor the authentication module and we decided to use JWT." It is the log of what happened. Episodic memory is most useful when you need to understand prior decisions, avoid repeating mistakes, or pick up where a past session left off. Notion AI uses a form of episodic memory when it builds context from a user's document history within a workspace.
Semantic memory stores general knowledge and facts without tying them to specific events: "The user prefers TypeScript over JavaScript. Their stack is Next.js and Supabase. They use tabs not spaces." It is the distilled understanding of the world or of the user. Semantic memory is what gives an agent the ability to adapt its behavior based on accumulated knowledge, rather than just replaying past events.
In practice, most agent memory systems blend both types. Episodic memory helps with recall and continuity; semantic memory helps with personalization and judgment. The challenge is maintaining both without letting the store grow so large that retrieval becomes noisy.
A third type worth mentioning is procedural memory: the agent's knowledge of how to do things, encoded in its tools, prompts, and system instructions. This is less often discussed as a memory type because it is usually baked into the agent's setup rather than dynamically updated. But some advanced architectures treat it as mutable - agents that can revise their own instructions based on experience are using a form of procedural memory.
Vector stores: the engine of semantic retrieval
The technical piece that makes long-term memory practically useful is vector search. The core idea is straightforward: any piece of text can be converted into a numerical vector (an embedding) that captures its semantic meaning. Similar texts produce similar vectors. If you store thousands of memory fragments as vectors, you can find the ones most relevant to a query by computing similarity in vector space - no exact keyword match required.
This is called a vector store (or vector database), and it is the dominant architecture for agent long-term memory today. When an agent needs to recall relevant context, it takes the current query, generates an embedding, and searches the vector store for the N most similar stored memories. Those results get injected into the prompt.
Popular vector databases include Chroma, Pinecone, Weaviate, and Qdrant. Each offers a slightly different tradeoff between simplicity, scalability, and query features. For small agents with modest memory needs, Chroma running locally is often enough. For production systems handling millions of memories across thousands of users, a managed service like Pinecone becomes the practical choice.
The quality of retrieval depends heavily on the embedding model used. A weak embedding model produces vectors that fail to capture nuance, and retrieval becomes unreliable. Most production systems use dedicated embedding models (OpenAI's text-embedding-3, Cohere Embed, or open-source alternatives like nomic-embed-text) separately from the main generation model.
The MCP memory approach
The Model Context Protocol has become an interesting pattern for connecting agents to external memory stores in a standardized way. Rather than baking memory retrieval directly into an agent's code, the MCP memory server exposes memory operations as tools: store a fact, query for relevant memories, update an existing entry, delete stale information.
This separation has practical advantages. The memory system becomes independent of the agent framework. You can swap out the underlying store, change the retrieval strategy, or share a memory server across multiple agents without rewriting the agents themselves. It also means agents can use their existing tool-calling infrastructure to manage memory, rather than requiring a separate memory API.
The tradeoff is latency. Every memory retrieval becomes a tool call, which adds a round trip compared to in-memory retrieval. For most use cases this is acceptable. For high-frequency agents that query memory on every loop iteration, it can become a bottleneck worth designing around.
MemGPT and the virtual context approach
MemGPT (now part of the Letta project) proposed a different framing of the memory problem that is worth understanding. Instead of treating memory as an external store that gets queried before a prompt is built, MemGPT treats memory management as something the agent does actively, inside the loop, using explicit function calls.
The agent has a fixed main context with defined sections: a persona block, a human block, a conversation scratchpad. When those sections fill up, the agent calls memory functions to move content in and out - archiving older conversation to a searchable external store, retrieving relevant past exchanges when needed. The agent knows about its own memory limits and manages them explicitly.
This is closer to how operating systems manage virtual memory than how most agents handle context. The model becomes aware of its own memory architecture and learns to use it deliberately, rather than just receiving a pre-populated context it cannot see or control.
Forgetting and memory hygiene
A memory system that only writes and never deletes eventually becomes a liability. Stale facts, outdated preferences, and incorrect beliefs pile up. Retrieval starts returning memories that are no longer true, and the agent's behavior degrades as it acts on outdated information.
Good memory architectures include some form of expiry or pruning. Common approaches:
Time decay: Memories get a confidence score that decreases over time. Old facts are deprioritized in retrieval and eventually removed.
Explicit invalidation: When new information contradicts a stored fact, the old entry gets marked as superseded. "User's stack is now Remix, not Next.js" invalidates the previous framework preference.
Periodic review: The agent (or a background process) scans the memory store and removes entries that appear stale, redundant, or low-value.
This is an area where most current implementations are weaker than they could be. Most agent frameworks focus heavily on writing to memory and retrieval, and handle deletion as an afterthought. As agents accumulate more history, memory hygiene will become a more significant engineering concern.
Choosing a memory architecture in practice
The right memory architecture depends on the agent's use case and expected lifespan.
For task-scoped agents that complete a job in one session and have no need to remember anything afterward, context window management is the only memory problem worth solving. Keep the context clean, summarize when needed, and do not add complexity you do not need.
For agents that work with the same user across multiple sessions - personal assistants, coding agents with persistent project context, research agents that build knowledge over time - you need a long-term store. A vector database for semantic search plus a structured store (Postgres, SQLite) for explicit facts is a common and reliable combination.
For agents that need to handle very long tasks within a single session - spanning hours, with thousands of tool calls - you need the kind of in-context memory management that Letta and MemGPT pioneered. Fixed-size memory blocks with explicit management are more reliable than hoping the context window holds everything.
Understanding how AI agents work at the loop level is the prerequisite for reasoning about where memory fits in the architecture. Memory is not a separate system sitting alongside the agent - it is a core part of how the perception-action loop maintains continuity across time.
What good agent memory looks like in practice
The most reliable signal that an agent has well-designed memory is subtler than most people expect. It is not that the agent dramatically recites things you told it three months ago. It is that the agent does not ask you the same question twice. It does not need you to re-explain your preferences each session. It does not treat each task as if it has never worked with you before.
Good memory is invisible in the same way good infrastructure is invisible. You only notice it when it breaks. An agent that asks "what programming language do you prefer?" for the fifth time in two weeks does not have a language problem or a reasoning problem. It has a memory problem.
The frameworks and protocols discussed here - vector stores, MCP memory servers, MemGPT's virtual context approach, Letta's managed memory blocks - are all working toward the same goal: agents that accumulate useful knowledge without accumulating noise, and that surface the right context at the right moment without drowning the prompt in everything they have ever been told.
That is a hard engineering problem. The field is still working on it. But the foundations are in place, and understanding them is the first step to building agents that actually get better at helping you over time.