When LLMs Learn to Remember — Part 1: LLMs Don't Remember Anything

Every week, someone asks me: "How does ChatGPT remember what I told it last month?"

The answer might surprise you: it doesn't.

Large Language Models are completely, fundamentally stateless. Every API call starts from absolute zero. There is no hidden database, no persistent memory, no neural trace of your previous conversations. An LLM is a pure mathematical function: text goes in, text comes out, everything is forgotten.

So how does it feel like your AI assistant knows you? Let's pull back the curtain.

The Illusion of Memory

When you open ChatGPT and it greets you by name or references a preference you shared weeks ago, here's what actually happens behind the scenes:

┌─────────────────────────────────────────────────┐
│              What you see:                       │
│  You: "Summarize my project status"             │
│  AI:  "Your project-alpha deploy is pending..." │
│                                                  │
│              What actually happens:              │
│  System prompt (your preferences, rules)        │
│  + Memory snippets (extracted from past chats)  │
│  + Conversation history (current session)       │
│  + Your message                                 │
│  ─────────────────────────────────────────────  │
│  = ONE massive prompt sent to stateless LLM     │
└─────────────────────────────────────────────────┘

Before every single response, the application layer silently prepends context to your message:

System prompt - Instructions and your saved preferences
Memory snippets - Facts previously extracted and stored in a database
Conversation history - All messages in the current chat session

The model reads all of this fresh every time, as if seeing it for the first time. The "memory" lives entirely outside the model, in the application wrapping it.

How ChatGPT Does It

OpenAI's memory system works like this:

After conversations, a background process extracts key facts ("user is a software engineer in Switzerland", "user prefers Docker deployments")
These are stored in a database tied to your account
On each new conversation, relevant memories are injected into the system prompt
The model appears to "know" you, but it's being told about you every time

How Claude Does It

Anthropic's approach with Claude is similar but with variations:

Claude on the web stores "memories" as extracted facts from conversations
Claude Code (the CLI tool) uses a file-based approach: a MEMORY.md index file points to individual markdown memory files organized by topic
These files are loaded into the conversation context at startup

The key takeaway: every AI assistant uses the same core pattern. The differences are in storage format, retrieval strategy, and who decides what's worth remembering.

The Core Pattern

Every AI memory system, from a weekend project to a production system serving millions, follows this architecture:

┌──────────┐    ┌─────────────┐    ┌──────────────┐    ┌─────────┐
│  Store    │ →  │  Retrieve   │ →  │  Inject into │ →  │  LLM    │
│  Knowledge│    │  Relevant   │    │  Prompt      │    │  Responds│
│           │    │  Subset     │    │              │    │         │
└──────────┘    └─────────────┘    └──────────────┘    └─────────┘

The only design decisions are:

Decision	Options
What to store	Raw conversations, curated summaries, structured facts
Where to store it	Files, SQLite, PostgreSQL, vector DB, graph DB
How to retrieve it	Read everything, keyword search, semantic similarity, tree traversal
Who curates it	The user manually, the same LLM, or a separate pipeline

Let's look at each strategy in detail.

Strategy 1: Conversation History (The Baseline)

The simplest approach: just keep the entire conversation and send it all back every time.

[System Prompt]
[Message 1: User said X]
[Message 2: AI said Y]
[Message 3: User said Z]
...
[Message N: User's new question]
         ↓
    Stateless LLM

How it works: Every message ever exchanged is stored and sent to the model each time.

The problem: Context windows have limits. GPT-4 handles ~128K tokens, Claude up to 200K. A single conversation might fit, but weeks of daily interactions won't. Once you hit the limit, you have to drop old messages (losing memory) or summarize them (losing detail).

Used by: Every chatbot for within-session memory. It's why the AI "forgets" when you start a new chat.

Strategy 2: Extracted Facts Database

Instead of keeping raw conversations, extract structured facts and store them separately.

Conversation: "I'm a software engineer in Switzerland, 
               I always use Docker, and I prefer FastAPI over Flask"
                          ↓
              Extraction Pipeline
                          ↓
┌─────────────────────────────────────────┐
│  Facts Database                          │
│  - Role: Software Engineer               │
│  - Location: Switzerland                 │
│  - Preference: Docker for all backends   │
│  - Preference: FastAPI over Flask        │
└─────────────────────────────────────────┘

How it works: After each conversation, a process (often another LLM call) extracts key facts and stores them as structured entries. On the next conversation, relevant facts are retrieved and injected.

Pros: Compact, scales to many conversations, preserves key information indefinitely.

Cons: The extraction step can lose nuance. "I prefer FastAPI but we're stuck with Flask for the legacy project" might get stored as just "prefers FastAPI."

Used by: ChatGPT's memory feature, many production AI assistants.

Strategy 3: Vector Database (RAG-Based Memory)

Embed everything into vectors and retrieve by semantic similarity.

"What did I decide about the deployment strategy?"
                    ↓
            Embed query → [0.12, -0.45, 0.78, ...]
                    ↓
         Cosine similarity search
                    ↓
┌─────────────────────────────────────────────┐
│  Vector DB (pgvector / Chroma / Qdrant)      │
│                                               │
│  [0.11, -0.44, 0.79, ...] → "March 5: decided│
│   to use blue-green deployment with Docker"   │
│  [0.08, -0.41, 0.75, ...] → "Feb 20: tested  │
│   rolling updates but too much downtime"      │
│  [...] → thousands more entries               │
└─────────────────────────────────────────────┘

How it works: Every piece of knowledge gets converted into a high-dimensional vector (embedding). When you ask a question, your query is also embedded, and the system finds the most similar stored vectors.

Pros: Scales to massive knowledge bases, handles fuzzy queries ("something about that meeting where we discussed pricing"), no exact keyword match needed.

Cons: Embeddings are lossy -- they capture similarity, not meaning. Two decisions that sound similar but have opposite conclusions might both get retrieved, confusing the model. Also requires infrastructure (embedding model + vector DB).

Used by: Enterprise AI systems, RAG applications, coding assistants with large codebases.

Strategy 4: Markdown Files (Structured Local Memory)

Store knowledge as organized plain-text files that the LLM reads directly.

memory/
├── MEMORY.md              (index - what exists and where)
├── user_preferences.md    (how the user works)
├── project_alpha.md       (project-specific context)
├── decisions_march.md     (key decisions with reasoning)
└── feedback_testing.md    (user corrections and guidance)

How it works: Knowledge is saved as markdown files, organized by topic. An index file maps what's stored where. At conversation start, the system reads the index and loads relevant files into context.

Pros: Human-readable and editable, version-controllable with Git, zero infrastructure, the user can inspect and correct what the AI "knows."

Cons: Doesn't scale well past ~100 files without good organization. Retrieval is basic (file selection, not semantic search).

Used by: Claude Code, ByteRover, Cursor, and other developer-focused AI tools.

Who Curates the Memory?

This is where it gets interesting. In this approach, the same LLM that reasons about your questions also decides what's worth remembering. There's no separate embedding pipeline or extraction service that might lose meaning along the way.

When the LLM notices you made an important decision or stated a preference, it writes a markdown file. When information becomes outdated, it updates or removes the file. The curator and the consumer of memory are the same entity.

This is the key insight that ByteRover and similar tools formalize: a separate pipeline introduces a lossy translation layer. The model that understands your intent is the best judge of what's worth preserving.

Strategy 5: The Context Tree (Not a B-Tree)

This is the approach that deserves deeper explanation, because it's often misunderstood.

What a B-Tree Is

A B-tree is a database index structure. It's a balanced, sorted tree where each node contains multiple keys and pointers. When you search for a value, you traverse from root to leaf in O(log n) time. It's designed for exact match and range queries on structured data.

           [50]                    ← Root
          /    \
      [20,30]  [60,80]            ← Internal nodes
      / | \     / | \
   [10][25][35][55][70][90]       ← Leaf nodes (data)

B-trees answer: "Find record with ID=35" or "Find all records between 20 and 50."

What a Context Tree Is

A context tree is fundamentally different. It's a semantic hierarchy where each level represents increasing specificity, and navigation is done by an LLM making relevance judgments, not by comparing sort keys.

memory/
├── index.md                     ← "What topics exist?"
│
├── work/
│   ├── index.md                 ← "What projects am I tracking?"
│   ├── project-alpha/
│   │   ├── overview.md          ← Project goals, stack, status
│   │   ├── decisions.md         ← Key architectural decisions
│   │   └── deployment.md        ← Deploy process, server details
│   └── blog/
│       ├── overview.md
│       └── analytics.md
│
├── preferences/
│   ├── coding.md                ← "Always use Docker, prefer FastAPI"
│   └── communication.md         ← "Be concise, no emojis"
│
└── decisions/
    ├── 2026-03.md               ← Decisions made in March
    └── 2026-04.md               ← Decisions made in April

How Retrieval Works

Here's the crucial difference from vector search or B-tree lookup:

Query: "What's the deploy process for project alpha?"

Step 1: LLM reads index.md
        → Decides: this is about "work" (not preferences, not decisions)

Step 2: LLM reads work/index.md
        → Decides: this is about "project-alpha" (not blog)

Step 3: LLM reads project-alpha/deployment.md
        → Found the answer. Stop.

Files read: 3 out of potentially hundreds
Tokens used: ~500 instead of ~15,000 (loading everything)
LLM calls for navigation: 0-1 (can often be done in a single pass)

Comparison Table

Property	B-Tree	Context Tree	Vector Search
Navigation	Compare sort keys	LLM judges relevance	Cosine similarity
Query type	Exact match, range	Natural language	Semantic similarity
Structure	Balanced, self-adjusting	Human-organized hierarchy	Flat (all vectors equal)
Scales to	Billions of records	Hundreds of documents	Millions of embeddings
Latency	Microseconds	Milliseconds	Milliseconds
Infrastructure	Database engine	File system	Embedding model + vector DB
Human-readable	No	Yes	No

Why It Works

The context tree achieves something clever: it uses the file system as an index and the LLM as a query engine. Instead of embedding documents and losing semantic nuance, it preserves full context in readable files and lets the reasoning model decide what's relevant.

The trade-off is clear: it doesn't scale to millions of documents (an LLM can't traverse a tree with thousands of branches efficiently), but for personal knowledge bases with dozens to hundreds of files, it's remarkably effective with minimal token usage.

Strategy 6: Hybrid Approaches

In practice, the most robust systems combine strategies:

┌─────────────────────────────────────────────┐
│             Hybrid Memory System             │
│                                              │
│  Layer 1: Core Context (always loaded)       │
│  ├── User preferences (markdown)             │
│  ├── Active project summaries (markdown)     │
│  └── Recent decisions (markdown)             │
│                                              │
│  Layer 2: Searchable Archive (on demand)     │
│  ├── Past conversations (vector DB)          │
│  ├── Meeting notes (vector DB)               │
│  └── Historical decisions (vector DB)        │
│                                              │
│  Layer 3: Structured Data (queried)          │
│  ├── Calendar events (SQLite/API)            │
│  ├── Task status (SQLite/API)               │
│  └── Contact information (SQLite)            │
└─────────────────────────────────────────────┘

Layer 1 is always injected (small, high-value, curated markdown). Layer 2 is searched only when needed (large, historical, vector-indexed). Layer 3 is queried for specific structured data.

This mirrors how human memory works: you always know your name and current project (Layer 1), you can recall relevant past experiences when prompted (Layer 2), and you look up specific facts in your calendar or contacts (Layer 3).

Choosing the Right Strategy

Use Case	Recommended Strategy	Why
Personal AI assistant (daily use)	Markdown files	Simple, inspectable, no infra needed
Team AI assistant	Markdown + cloud sync	Shared context across team members
AI over large document corpus	Vector DB (RAG)	Semantic search over thousands of docs
Enterprise knowledge management	Hybrid (all layers)	Different data types need different access patterns
Coding assistant	Context tree + codebase indexing	Navigate large codebases efficiently
AI that tracks decisions over years	SQLite + markdown summaries	Structured queries + readable summaries

The Practical Takeaway

Here's what I want you to remember from this post:

LLMs are stateless. Always. Every "memory" is an application-level feature that re-injects stored context into each prompt.
You don't need a vector database for personal AI memory. A well-organized folder of markdown files handles 90% of use cases with 10% of the complexity.
The context tree is not a B-tree. It's a semantic hierarchy navigated by LLM judgment, not sort-key comparison. It trades scale for simplicity and human readability.
The same LLM should curate its own memory. Separate extraction pipelines introduce lossy translation. The model that understands your intent is the best judge of what to save.
Start simple, add complexity only when needed. Markdown files → add SQLite when you need date queries → add vectors when you have thousands of documents.

The best AI memory system is the one you can actually inspect, correct, and trust. For most people, that's a folder of well-organized text files -- not a distributed vector database.

Have questions about building AI memory systems? Find me on LinkedIn or check out the blog for more deep dives into practical AI engineering.