When LLMs Learn to Remember — Part 2: Building a Memory System for Your AI Assistant
A practical guide to building persistent AI memory: Memory CRUD operations, post-conversation sweeps, context tree curation, prompt templates, and the unsolved problems nobody talks about.
In the previous post, we established that LLMs are completely stateless and that all "memory" is an application-level illusion. We compared six strategies for making AI assistants remember things.
Now the real question: how do you actually build and maintain a memory system?
This post covers the engineering side — who creates memories, when they get updated, how to prevent rot, and the unsolved problems the industry is still figuring out.
The Memory Lifecycle Problem
Storing a fact is easy. Keeping a knowledge base accurate over months is hard.
Think about it: you tell your AI assistant "I'm using Flask for this project" in January. In March, you migrate to FastAPI. If nobody updates that memory, your assistant keeps giving you Flask-based advice. Stale memory is worse than no memory.
This is the same problem every database faces. The difference is that AI memories are unstructured, semantic, and context-dependent — there's no schema to enforce consistency, no foreign keys to cascade deletes.
So we need to think about memory management the same way we think about data management: with explicit operations, rules, and lifecycle policies.
Memory CRUD: Operations for AI Knowledge
Just like databases have INSERT, UPDATE, DELETE, and SELECT, AI memory systems need a formal set of operations. Here's what's emerging as the standard:
The Core Operations
| Operation | When to Use | Example |
|---|---|---|
| CREATE | New fact, decision, or preference discovered | "User chose Kubernetes over Docker Swarm for production" |
| UPDATE | Existing fact has changed or evolved | "Project status: planning → development → deployed" |
| MERGE | Two memories overlap or are redundant | Combine "uses Docker" + "prefers Docker Compose" + "runs all backends in containers" into one preference |
| DELETE | Fact is no longer true or relevant | "Project X was cancelled" → remove all Project X memories |
| PROMOTE | Temporary observation confirmed as permanent | User corrected the same behavior 3 times → upgrade to core preference |
| DECAY | Memory hasn't been accessed or referenced in a long time | Auto-archive after N days with no reads |
CREATE: When to Store New Knowledge
Not everything is worth remembering. Good memory creation follows a filter:
New information arrives
↓
Is it already known? → SKIP (check existing memories)
↓
Is it in the code/git? → SKIP (derivable from source)
↓
Is it ephemeral? → SKIP (only useful right now)
↓
Will it help in future conversations? → CREATE
What passes this filter:
- Decisions and their reasoning ("chose Postgres over MongoDB because we need ACID transactions")
- User preferences and corrections ("don't add type annotations to code I didn't change")
- Project context not in code ("deadline is March 15, driven by client contract")
- External references ("bugs tracked in Linear project INGEST")
What doesn't:
- Code patterns (read the code)
- Git history (run git log)
- Current task progress (use a task tracker)
- Things already documented in project files
UPDATE: The Hardest Operation
Updates are where most memory systems fail. The challenge:
- Detecting that an update is needed — the AI must recognize that new information contradicts or evolves an existing memory
- Finding the right memory to update — semantic matching, not exact string matching
- Preserving history — sometimes knowing what was true matters ("we used to use Flask, migrated to FastAPI in March because...")
A naive implementation just appends new memories and never touches old ones. This leads to contradictions — the system holds both "uses Flask" and "uses FastAPI" and picks whichever it retrieves first.
Best practice: Every memory should have a timestamp. When updating, either overwrite with a note about the change, or version the memory:
---
name: Backend Framework Choice
updated: 2026-03-15
---
Currently using FastAPI for all new projects.
**History:**
- 2026-01: Started with Flask
- 2026-03: Migrated to FastAPI (better async support, auto-generated docs)
DELETE: When to Forget
Deletion is uncomfortable — what if we need that information later? But memory bloat is a real problem. Every extra memory file costs tokens when loaded into context.
Clear deletion triggers:
- Project completed or cancelled
- Decision reversed with no historical value
- Information moved to a permanent document (wiki, README)
- Factually wrong (corrected by user)
MERGE: Consolidation
Over time, related memories accumulate:
- "User prefers Docker" (from January)
- "User runs all backends in Docker Compose" (from February)
- "User doesn't want local installs, everything containerized" (from March)
These should become one memory:
---
name: Containerization Preference
type: user
---
All backends run in Docker Compose. No local installs.
Uses Docker for development and production environments.
DECAY: Automatic Aging
Not all memories are equal. Some are permanent ("user's name"), some are temporary ("currently debugging auth middleware"). A decay system handles this:
Memory created
↓
Accessed within 30 days? → Reset decay timer
↓
Not accessed in 30 days? → Mark as stale
↓
Not accessed in 90 days? → Archive or delete
This prevents memory rot without manual cleanup. The threshold depends on memory type — user preferences should never decay, project status should decay fast.
Who Creates the Memory?
This is the critical design decision. There are three approaches:
Approach 1: Real-Time Self-Curation
The same LLM that's chatting with you decides what to save, during the conversation.
User: "From now on, always use uv instead of pip"
↓
LLM thinks: "This is a preference correction, I should save it"
↓
LLM writes: feedback_use_uv.md
↓
Conversation continues
Pros:
- Immediate — preferences take effect in the same conversation
- Handles explicit "remember this" requests naturally
- No extra infrastructure
Cons:
- The LLM is focused on answering your question, not curating memory. It often forgets to save things.
- Important implicit knowledge gets missed ("user always approves refactors in a single PR" — the LLM doesn't realize this is a pattern until told)
- Interrupts the conversation flow with file writes
Approach 2: Post-Conversation Sweep
After the conversation ends, a separate process reviews the full transcript and extracts knowledge.
Conversation ends
↓
Sweep process reads full transcript
↓
Compares against existing memory index
↓
Outputs: CREATE / UPDATE / DELETE operations
↓
Applies changes to memory files
Pros:
- Has full conversation context (hindsight is clearer than real-time)
- Doesn't interrupt the conversation
- Can detect implicit patterns across the conversation
- Can use a cheaper model (the sweep is a simpler task than the original conversation)
Cons:
- Delayed — new preferences don't take effect until next conversation
- Requires infrastructure (hook to trigger sweep after conversation)
- Extra cost per conversation (~$0.01-0.02 with a fast model)
Approach 3: Hybrid (Recommended)
Both approaches together:
- Real-time handles explicit requests ("remember this", "from now on do X")
- Post-conversation sweep catches implicit knowledge, detects patterns, and runs maintenance (merge, decay, delete)
This is the pattern emerging in production systems. The real-time path handles urgency; the sweep handles quality.
Building a Post-Conversation Sweep
Here's a practical implementation you can build today.
The Sweep Prompt
The sweep LLM needs three things: existing memories, the conversation transcript, and clear instructions.
You are a memory curator for an AI assistant. Your job is to keep
the assistant's memory accurate and useful.
## EXISTING MEMORY INDEX
{contents of MEMORY.md}
## EXISTING MEMORY FILES
{contents of each referenced memory file}
## CONVERSATION TRANSCRIPT
{full conversation that just ended}
## YOUR TASK
Review the conversation and output a JSON array of operations:
[
{
"operation": "CREATE",
"file": "project_new_api.md",
"type": "project",
"name": "New API Project",
"description": "API rewrite project started April 2026",
"content": "Started API rewrite using FastAPI...\n\n**Why:**..."
},
{
"operation": "UPDATE",
"file": "user_preferences.md",
"reason": "User corrected testing approach",
"new_content": "Updated content here..."
},
{
"operation": "DELETE",
"file": "project_old_migration.md",
"reason": "Migration completed, no longer relevant"
},
{
"operation": "NOOP",
"reason": "Conversation was routine, no new knowledge"
}
]
## RULES
1. Only CREATE if the information will help in FUTURE conversations
2. Only UPDATE if something actually changed (not just was mentioned)
3. Only DELETE if a fact is confirmed wrong or permanently irrelevant
4. Most conversations should result in NOOP — not everything is memorable
5. Never store: code patterns, git history, current task progress,
secrets/credentials
6. Convert relative dates to absolute (e.g., "next Thursday" → "2026-04-10")
7. For CREATE/UPDATE: include WHY (reasoning) not just WHAT (the fact)
8. Check for duplicates before CREATE — prefer UPDATE over CREATE
9. Keep memory files focused — one topic per file
10. Update the MEMORY.md index for any CREATE or DELETE
The Execution Layer
After the sweep LLM returns operations, a script applies them:
import json
import os
def apply_memory_operations(operations, memory_dir):
for op in operations:
if op["operation"] == "NOOP":
continue
elif op["operation"] == "CREATE":
file_path = os.path.join(memory_dir, op["file"])
content = f"""---
name: {op["name"]}
description: {op["description"]}
type: {op["type"]}
---
{op["content"]}
"""
with open(file_path, "w") as f:
f.write(content)
update_index(memory_dir, op["file"], op["name"], op["description"])
elif op["operation"] == "UPDATE":
file_path = os.path.join(memory_dir, op["file"])
with open(file_path, "r") as f:
existing = f.read()
# Preserve frontmatter, update content
frontmatter_end = existing.index("---", 3) + 3
frontmatter = existing[:frontmatter_end]
with open(file_path, "w") as f:
f.write(frontmatter + "\n\n" + op["new_content"])
elif op["operation"] == "DELETE":
file_path = os.path.join(memory_dir, op["file"])
os.remove(file_path)
remove_from_index(memory_dir, op["file"])
Triggering the Sweep
This can be a hook that fires when a conversation ends:
# .claude/hooks/post-conversation.sh
#!/bin/bash
# Export the conversation transcript
TRANSCRIPT=$(cat "$CONVERSATION_FILE")
# Run sweep with a cheap, fast model
python3 memory_sweep.py \
--memory-dir ~/.claude/projects/my-project/memory/ \
--transcript "$TRANSCRIPT" \
--model "claude-haiku-4-5-20251001"
Cost: approximately $0.01-0.02 per sweep with Haiku. For 20 conversations a day, that's $0.20-0.40/day.
The Context Tree: How Retrieval Scales
As your memory grows past 20-30 files, you can't load everything into context. You need a retrieval strategy.
Flat Loading (Simple, Doesn't Scale)
Start conversation
→ Load ALL memory files into context
→ 50 files × 200 tokens average = 10,000 tokens
→ 100 files = 20,000 tokens (getting expensive)
→ 500 files = 100,000 tokens (unusable)
Context Tree (Hierarchical Retrieval)
Organize memories into a tree and only load what's needed:
memory/
├── index.md ← Always loaded (~200 tokens)
│
├── core/ ← Always loaded (~500 tokens)
│ ├── user.md (identity, role, preferences)
│ └── active_projects.md (what's currently in progress)
│
├── projects/ ← Loaded on demand
│ ├── index.md (one-line summary per project)
│ ├── project-alpha/
│ │ ├── overview.md
│ │ ├── decisions.md
│ │ └── architecture.md
│ └── blog/
│ ├── overview.md
│ └── analytics.md
│
├── feedback/ ← Loaded on demand
│ ├── index.md (one-line summary per rule)
│ ├── testing.md
│ └── code_style.md
│
└── archive/ ← Rarely loaded
├── completed_projects/
└── outdated_decisions/
The Retrieval Algorithm
Step 1: Always load core/ (~500 tokens)
"User is Alex, software engineer, prefers Docker..."
Step 2: Read the user's message
"How should I deploy project alpha?"
Step 3: LLM decides which branches to load
→ "This is about projects/project-alpha/"
→ Load projects/project-alpha/overview.md
→ Load projects/project-alpha/decisions.md
Step 4: Total context loaded: ~1,500 tokens
(instead of 10,000+ for loading everything)
How Is This Different from a B-Tree?
A B-tree is a data structure optimized for sorted key lookups in databases. It maintains balance (all leaves at the same depth) and uses comparison operators (less than, greater than, equal) to navigate.
A context tree is fundamentally different:
| B-Tree | Context Tree | |
|---|---|---|
| Navigation | Compare sort keys mechanically | LLM judges semantic relevance |
| Balance | Self-balancing (guaranteed O(log n)) | Human-organized (can be unbalanced) |
| Query | Exact match or range (WHERE id > 50) | Natural language ("anything about deployment?") |
| Node decision | Binary: go left or go right | Multi-way: load this branch, skip that one, maybe check two |
| Maintenance | Automatic (rebalance on insert) | Manual or LLM-assisted reorganization |
The context tree is closer to a library catalog than a database index. The index tells you which shelf to check, and your judgment (or the LLM's) determines what to pull off the shelf.
When to Restructure the Tree
Just like a library occasionally reorganizes its sections, your context tree needs periodic restructuring:
- A branch has too many files (>15): Split into sub-branches
- Two branches overlap heavily: Merge them
- A branch is never accessed: Archive it
- The index file exceeds 200 lines: Summarize and consolidate
This restructuring can be part of the post-conversation sweep (run weekly or when file count crosses a threshold).
The Unsolved Problems
AI memory management is a young field. Here are the problems nobody has fully solved:
1. Conflict Resolution
Two conversations produce contradicting memories:
- Monday: "User wants microservices architecture"
- Wednesday: "User wants monolith for simplicity"
Which one wins? Options:
- Last-write-wins (simple but loses context)
- Timestamp + reasoning (keep both with dates, let LLM judge which is current)
- Ask the user (safest but annoying)
No standard exists. Most systems use last-write-wins, which silently loses the reasoning behind the first decision.
2. Memory Scoring and Prioritization
When context is limited, which memories get loaded first? A memory accessed 50 times is probably more important than one accessed twice. But a fresh memory about a critical deadline might matter more than a frequently-accessed preference.
Possible scoring:
score = (access_count × 0.3) + (recency × 0.4) + (type_weight × 0.3)
type_weights:
user_preference: 0.9
active_project: 0.8
feedback: 0.7
reference: 0.5
archived: 0.1
Nobody has validated what weights actually work. It's all intuition right now.
3. Cross-Session Deduplication
Five conversations all note "user prefers Docker." Should you:
- Store it once (efficient but fragile — one bad delete loses it)
- Let frequency reinforce confidence (robust but bloated)
- Score by confirmation count (balanced but complex)
4. Privacy and Forgetting
Users need to be able to say "forget everything about project X" and have it actually happen. This requires:
- Reliable discovery of all related memories (not just exact keyword matches)
- Cascade deletion (removing memories that reference the deleted ones)
- Verification that deletion was complete
This is the "right to be forgotten" problem, and it's surprisingly hard when memories are unstructured text.
5. Multi-Agent Memory Sharing
When multiple AI agents work in the same project (a coding agent, a review agent, a deploy agent), how do they share memory?
- Shared memory folder? (conflict risk)
- Per-agent folders with a shared core? (duplication)
- Event-based sync? (complexity)
This is an active research area with no consensus.
Practical Starting Points
If you want to implement AI memory today, here's the progression:
Level 1: Manual Markdown (10 minutes to set up)
Create a memory/ folder with an index.md. Write memories manually. Load them into your AI tool's context.
Level 2: LLM Self-Curation (already built into some tools)
Use a tool like Claude Code that writes memory files during conversations. Review and clean up periodically.
Level 3: Post-Conversation Sweep (a few hours to build)
Add a hook that runs after each conversation. Use a cheap model to extract and manage memories automatically.
Level 4: Full Context Tree with Decay (a weekend project)
Organize into hierarchical folders. Add access tracking. Implement decay logic. Run weekly restructuring.
Level 5: Hybrid with Vector Search (when you actually need it)
Add a vector database for large archives. Keep markdown for curated core knowledge. Route queries to the right layer.
Most people will get 90% of the value from Level 2 or 3. Don't jump to Level 5 because it sounds impressive — you'll spend more time maintaining infrastructure than benefiting from memory.
What's Next
The tools will get better. Memory management will become a standard feature, not a DIY project. But the core principles won't change:
- LLMs are stateless — memory is always external
- Curation matters more than storage — knowing what to forget is as important as knowing what to remember
- Start simple — a folder of markdown files beats an over-engineered knowledge graph
- The same model should curate — separate pipelines lose meaning
The question isn't whether your AI needs memory. It's whether you'll design it intentionally, or let it grow into an unmanageable mess.
Choose to design it.
This is Part 2 of a series on AI memory. Read Part 1: LLMs Don't Remember Anything for the fundamentals. Find me on LinkedIn for more practical AI engineering content.