How AI Agents Do Deep Search — Building a Research Agent from Scratch
Why a single LLM call fails for complex research questions, and how to build an agent that iteratively searches, reads, reasons, and synthesizes — with full Python code you can run.
In the previous posts, we built agents that call tools and verify their own actions. Those agents work great when the task is clear: update a record, query a database, send an email. But what happens when the user asks something like:
"What are the trade-offs between ReAct and Plan-and-Execute agent architectures for production systems?"
No single tool call answers that. No single web search returns a clean result. The agent needs to search, read, think, search again — potentially across multiple sources — and then synthesize everything into a coherent answer.
This is deep search: an agent that researches a topic iteratively, the same way you would. This post shows how to build one from scratch.
Why a Single LLM Call Isn't Enough
Let's start with why the naive approach fails. If you just ask an LLM:
User: "What are the latest developments in AI agent memory systems?"
You get a response based on the model's training data — which has a cutoff date. It can't tell you what was published last week. It can't cite specific papers. It might confidently describe something that doesn't exist.
OK, so give it a search tool:
User: "What are the latest developments in AI agent memory systems?"
Model calls: web_search("latest developments AI agent memory systems")
→ Returns 5 snippets
Model: "Here's what I found..." (summarizes the 5 snippets)
Better, but still shallow. The snippets are short. The model summarizes surfaces-level results without understanding any source deeply. It can't follow up on promising leads or cross-reference between sources.
Deep search is what happens when the agent doesn't stop at one search. It reads results, identifies gaps, formulates new queries, reads more, and builds understanding iteratively — until it has enough to give a thorough answer.
The Deep Search Loop
Here's the pattern:
┌─────────────────────────────────────────────────────┐
│ User Question │
└──────────────────────┬──────────────────────────────┘
│
▼
┌─────────────────┐
│ Decompose into │
│ sub-queries │
└────────┬────────┘
│
┌────────────▼────────────┐
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Search │ │ Search │
│ sub-query │ │ sub-query │
│ #1 │ │ #2 │
└──────┬──────┘ └──────┬──────┘
│ │
▼ ▼
┌─────────────┐ ┌─────────────┐
│ Read page │ │ Read page │
│ Extract │ │ Extract │
│ key points │ │ key points │
└──────┬──────┘ └──────┬──────┘
│ │
└────────────┬────────────┘
│
┌────────▼────────┐
│ Gaps found? │──── Yes ──→ New search queries
│ Contradictions?│ │
└────────┬────────┘ │
│ No │
▼ ◄─────────────┘
┌─────────────────┐
│ Synthesize │
│ final answer │
│ with citations │
└─────────────────┘
The agent decides at each step: do I know enough, or do I need to keep searching? This is the same ReAct loop from earlier posts, but the tools are oriented toward information gathering instead of data mutation.
The Tools
A deep search agent needs three types of tools:
| Tool | Purpose | When the Agent Uses It |
|---|---|---|
web_search | Find relevant URLs and snippets | Starting a new sub-query or following a lead |
read_page | Extract full content from a URL | When a search snippet looks promising but needs deeper reading |
save_note | Store a finding with source attribution | After extracting useful information from a page |
Let's build each one.
Tool 1: Web Search
import httpx
import os
import json
async def web_search(query: str, num_results: int = 5) -> str:
"""Search the web using Tavily API and return results."""
async with httpx.AsyncClient(timeout=30) as client:
response = await client.post(
"https://api.tavily.com/search",
json={
"api_key": os.getenv("TAVILY_API_KEY"),
"query": query,
"max_results": num_results,
"include_raw_content": False,
},
)
data = response.json()
results = []
for r in data.get("results", []):
results.append({
"title": r["title"],
"url": r["url"],
"snippet": r["content"][:500],
})
return json.dumps({
"query": query,
"results": results,
"count": len(results),
})
Nothing fancy — we call a search API and return titles, URLs, and snippets. The model reads these snippets and decides which pages are worth reading in full.
Why Tavily? It's built for AI agents — returns clean content instead of ad-heavy HTML. You could also use SerpAPI, Brave Search, or DuckDuckGo's API. The agent pattern is the same regardless of which search provider you use.
Tool 2: Read Page
async def read_page(url: str) -> str:
"""Fetch and extract the main text content from a URL."""
async with httpx.AsyncClient(timeout=30, follow_redirects=True) as client:
response = await client.get(
"https://api.tavily.com/extract",
params={
"api_key": os.getenv("TAVILY_API_KEY"),
"urls": url,
},
)
data = response.json()
results = data.get("results", [])
if not results:
return json.dumps({"error": "Could not extract content", "url": url})
content = results[0].get("raw_content", "")
# Trim to stay within token budget
if len(content) > 8000:
content = content[:8000] + "\n\n[...content truncated]"
return json.dumps({
"url": url,
"content": content,
})
This is where deep search gets its depth. Instead of relying on 200-character snippets, the agent reads full pages. The 8000 character limit keeps us within token budget — if you read from your Ai_agent.md skill, you'll recognize this pattern of trimming large responses.
Tool 3: Save Note (Agent's Scratchpad)
class ResearchNotes:
"""In-memory scratchpad for the agent to accumulate findings."""
def __init__(self):
self.notes: list[dict] = []
def save(self, finding: str, source_url: str) -> str:
note = {
"id": len(self.notes) + 1,
"finding": finding,
"source": source_url,
}
self.notes.append(note)
return json.dumps({
"saved": True,
"note_id": note["id"],
"total_notes": len(self.notes),
})
def get_all(self) -> str:
return json.dumps({"notes": self.notes, "count": len(self.notes)})
Why does the agent need a scratchpad? Because deep search involves multiple search-read cycles. Without notes, the agent would need to keep all findings in the conversation history, blowing up the token count. The scratchpad lets it store key points and discard the full page content.
Think of it like a researcher reading papers — you don't memorize every paper. You take notes on the important parts and refer back to your notes when writing.
Tool Schemas
Following the same pattern from Part 1, we define JSON schemas so the LLM knows what's available:
tools = [
{
"type": "function",
"function": {
"name": "web_search",
"description": (
"Search the web for information. Returns titles, URLs, and "
"short snippets. Use this to find relevant sources, then use "
"read_page to get full content from promising results."
),
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "The search query. Be specific."
},
"num_results": {
"type": "integer",
"description": "Number of results (default 5, max 10)"
}
},
"required": ["query"]
}
}
},
{
"type": "function",
"function": {
"name": "read_page",
"description": (
"Read the full text content of a web page. Use this when a "
"search snippet looks relevant but you need more detail. "
"Returns the extracted main content (up to 8000 chars)."
),
"parameters": {
"type": "object",
"properties": {
"url": {
"type": "string",
"description": "The URL to read"
}
},
"required": ["url"]
}
}
},
{
"type": "function",
"function": {
"name": "save_note",
"description": (
"Save a research finding to your notes. Use this after reading "
"a page to record the key insight. Always include the source URL. "
"Your notes will be used to compose the final answer."
),
"parameters": {
"type": "object",
"properties": {
"finding": {
"type": "string",
"description": "The key finding or insight to save"
},
"source_url": {
"type": "string",
"description": "URL where this information was found"
}
},
"required": ["finding", "source_url"]
}
}
},
{
"type": "function",
"function": {
"name": "get_notes",
"description": (
"Retrieve all saved research notes. Use this before writing "
"your final answer to review everything you've found."
),
"parameters": {
"type": "object",
"properties": {}
}
}
},
]
Notice how the descriptions guide behavior: "Use this after reading a page", "Use this before writing your final answer". These aren't just documentation — they're instructions that shape the agent's workflow.
The System Prompt: Teaching the Agent to Research
This is where the deep search behavior comes from. The system prompt teaches the agent how to research, not just what tools to use:
SYSTEM_PROMPT = """You are a research agent. Your job is to thoroughly
research a topic and provide a comprehensive, well-sourced answer.
## Research Process
1. DECOMPOSE: Break the user's question into 2-4 specific sub-questions
that, if answered, would fully address the original question.
2. SEARCH: For each sub-question, search the web with a focused query.
Don't search the user's exact question — rephrase it into effective
search queries.
3. READ: When a search result looks promising, read the full page.
Don't rely only on snippets — they often lack context.
4. NOTE: After reading each useful page, save a note with the key
finding and the source URL. Be specific — include numbers, dates,
names, and direct quotes when relevant.
5. ITERATE: After your first round of searches, review your notes.
Are there gaps? Contradictions? Unanswered sub-questions?
If yes, search again with refined queries.
6. SYNTHESIZE: Once you have enough information (at least 3-5 sources),
retrieve all your notes and write a comprehensive answer.
## Rules
- ALWAYS read at least 2-3 full pages before answering. Snippets alone
are not enough for a thorough answer.
- ALWAYS save notes with source URLs. Your final answer must cite sources.
- If search results are poor, try different query phrasings.
- If sources contradict each other, note the disagreement and explain both
perspectives.
- Aim for depth over breadth. It's better to deeply understand 3 sources
than to skim 10.
- Your final answer should be well-structured with sections, and include
a "Sources" list at the end.
"""
This prompt encodes the research methodology. The agent doesn't need hardcoded logic for "search twice then synthesize" — it follows the process described in its prompt, using the ReAct reasoning pattern to decide when it has enough information.
The Agent Loop
The loop itself is identical to what we built in Part 1. The intelligence comes from the tools and the system prompt, not from the loop:
from openai import AsyncOpenAI
client = AsyncOpenAI(
base_url="https://openrouter.ai/api/v1",
api_key=os.getenv("OPENROUTER_API_KEY"),
)
async def deep_search(question: str, max_iterations: int = 15) -> str:
"""Run the deep search agent on a question."""
notes = ResearchNotes()
# Tool dispatch registry
tool_functions = {
"web_search": lambda **kwargs: web_search(**kwargs),
"read_page": lambda **kwargs: read_page(**kwargs),
"save_note": lambda **kwargs: notes.save(**kwargs),
"get_notes": lambda **kwargs: notes.get_all(),
}
messages = [
{"role": "system", "content": SYSTEM_PROMPT},
{"role": "user", "content": question},
]
for i in range(max_iterations):
response = await client.chat.completions.create(
model="google/gemini-2.0-flash-001",
messages=messages,
tools=tools,
)
message = response.choices[0].message
# No tool calls → agent is done researching
if not message.tool_calls:
return message.content
messages.append(message)
# Execute each tool call
for tc in message.tool_calls:
func_name = tc.function.name
args = json.loads(tc.function.arguments or "{}")
func = tool_functions.get(func_name)
if func:
result = await func(**args) if asyncio.iscoroutinefunction(func) else func(**args)
else:
result = json.dumps({"error": f"Unknown tool: {func_name}"})
messages.append({
"role": "tool",
"tool_call_id": tc.id,
"content": result,
})
print(f" [{i+1}] {func_name}({list(args.keys())})")
return "Research reached maximum iterations. Partial results may be in notes."
The max_iterations is set to 15 — higher than a typical agent — because deep search involves more steps: multiple searches, page reads, and note-taking before the final synthesis.
The print line gives you visibility into what the agent is doing. A typical research session looks like:
[1] web_search(['query'])
[1] web_search(['query'])
[2] read_page(['url'])
[2] read_page(['url'])
[3] save_note(['finding', 'source_url'])
[3] save_note(['finding', 'source_url'])
[4] web_search(['query'])
[5] read_page(['url'])
[6] save_note(['finding', 'source_url'])
[7] get_notes([])
Seven iterations, three searches, three pages read, three notes saved, one final retrieval before synthesis. That's the depth that a single search-and-summarize can't match.
Running It
import asyncio
async def main():
question = (
"What are the main approaches to giving AI agents long-term memory, "
"and what are the trade-offs of each approach?"
)
print(f"Researching: {question}\n")
answer = await deep_search(question)
print("\n" + "=" * 60)
print(answer)
asyncio.run(main())
What the Agent Actually Does
Let's trace through a real execution. The user asks about AI agent memory systems. Here's what happens:
Iteration 1 — Decompose and search:
Thought: I need to break this into sub-questions:
1. What memory architectures exist for AI agents?
2. How do vector databases compare to other memory approaches?
3. What are real-world implementations (MemGPT, LangChain, etc.)?
Action: web_search("AI agent memory architectures survey 2025")
Action: web_search("MemGPT long-term memory LLM agents")
The agent decomposes the question and fires off two searches in parallel — it already knows from the tool schema that it can do this.
Iteration 2 — Read promising results:
Thought: The first result from Lilian Weng's blog looks comprehensive.
The MemGPT paper link looks relevant too.
Action: read_page("https://lilianweng.github.io/posts/...")
Action: read_page("https://arxiv.org/abs/...")
Instead of relying on 200-character snippets, the agent reads full pages. This is where depth comes from.
Iteration 3 — Save findings:
Thought: The Weng blog covers three memory types: sensory, short-term,
and long-term. The MemGPT paper introduces virtual context management.
Action: save_note(
finding="Three memory types: (1) Sensory - raw input buffer, (2) Short-term
- conversation window/context, (3) Long-term - vector DB retrieval. Weng
categorizes long-term into episodic (past experiences) and semantic (facts).",
source_url="https://lilianweng.github.io/posts/..."
)
Action: save_note(
finding="MemGPT uses OS-inspired virtual memory: pages data between fast
context window and slow external storage. Self-directed memory management
via function calling. Tested on document QA and multi-session chat.",
source_url="https://arxiv.org/abs/..."
)
The notes are specific — names, categories, mechanisms. Not vague summaries.
Iteration 4 — Identify gaps and search again:
Thought: I have theory (Weng) and one implementation (MemGPT), but I'm
missing practical trade-offs. Let me search for comparisons.
Action: web_search("vector database vs knowledge graph AI agent memory comparison")
This is the key moment — the agent evaluates its own knowledge and decides it needs more. A single-search agent would have stopped at iteration 1.
Iteration 5-6 — Read more, save more notes
Iteration 7 — Synthesize:
Thought: I have 5 notes from 4 sources covering theory, implementations,
and trade-offs. Let me review everything and write the answer.
Action: get_notes()
→ Returns all 5 saved findings with sources
Thought: I have enough to write a comprehensive answer with citations.
Final Answer: [structured response with sections and source list]
Query Decomposition: The Key to Better Search
The biggest difference between shallow and deep search is how the agent formulates queries. A naive agent searches the user's exact question. A deep search agent decomposes it.
User question:
"How do modern AI coding assistants handle context windows larger than their training limit?"
Naive search:
web_search("modern AI coding assistants context windows larger than training limit")
Decomposed searches:
web_search("long context window techniques LLM 2025 RAG chunking")
web_search("Cursor Copilot context management architecture")
web_search("retrieval augmented generation code assistants implementation")
The decomposed queries are more specific, use domain terminology, and target different aspects of the question. This is taught through the system prompt — the agent learns to "break the question into 2-4 specific sub-questions."
Token Budget Management
Deep search is expensive in tokens. Each page read can be 2000-8000 tokens. After several iterations, the conversation history grows fast. Here are the techniques to manage it:
1. Trim Page Content
We already do this in read_page — capping at 8000 characters. But you can be smarter:
async def read_page(url: str) -> str:
# ... fetch content ...
# Aggressive trimming for very long pages
if len(content) > 8000:
# Keep first 4000 (usually intro + key points) and last 2000 (conclusion)
content = (
content[:4000]
+ "\n\n[...middle section omitted...]\n\n"
+ content[-2000:]
)
return json.dumps({"url": url, "content": content})
2. Notes Replace Full Content
This is why the save_note tool exists. Once the agent saves a note from a page, it doesn't need the full page content anymore. The note (50-200 tokens) replaces the page (2000-8000 tokens) in the agent's working memory.
3. Set Iteration Limits Per Phase
SYSTEM_PROMPT += """
## Budget
- Maximum 3 search queries per sub-question
- Maximum 5 pages read total
- Save notes as you go — don't try to remember everything
- After reading 5 pages, stop searching and synthesize from your notes
"""
This prevents the agent from spiraling into endless research. Real researchers have deadlines too.
Adding Source Verification
Building on the verification pattern from Part 2, we can add a verification step to deep search. When the agent finds a critical claim, it can cross-reference it:
{
"type": "function",
"function": {
"name": "verify_claim",
"description": (
"Search for a second source to verify a specific claim. "
"Use this when you find an important fact that you want to "
"confirm before including in your final answer."
),
"parameters": {
"type": "object",
"properties": {
"claim": {
"type": "string",
"description": "The specific claim to verify"
},
"original_source": {
"type": "string",
"description": "URL of the original source"
}
},
"required": ["claim", "original_source"]
}
}
}
async def verify_claim(claim: str, original_source: str) -> str:
"""Search for a second source to confirm or deny a claim."""
# Search for the claim, excluding the original domain
search_result = await web_search(
query=claim,
num_results=3,
)
results = json.loads(search_result)["results"]
# Filter out results from the same domain
original_domain = original_source.split("/")[2]
other_sources = [
r for r in results
if original_domain not in r["url"]
]
return json.dumps({
"claim": claim,
"original_source": original_source,
"other_sources": other_sources,
"verified": len(other_sources) > 0,
})
Now the agent can say: "According to Source A, X is true. This is corroborated by Source B." — or flag when a claim appears in only one source.
The Architecture at a Glance
┌─────────────────────────────────────────────────────────┐
│ DEEP SEARCH AGENT │
├─────────────────────────────────────────────────────────┤
│ │
│ System Prompt │
│ ┌─────────────────────────────────────────────────┐ │
│ │ Research methodology: decompose → search → │ │
│ │ read → note → iterate → synthesize │ │
│ └─────────────────────────────────────────────────┘ │
│ │
│ Tools Scratchpad │
│ ┌──────────────┐ ┌──────────────────┐ │
│ │ web_search │ │ Note 1: "..." │ │
│ │ read_page │──────────────│ Note 2: "..." │ │
│ │ save_note │ saves to │ Note 3: "..." │ │
│ │ get_notes │──────────────│ ... │ │
│ │ verify_claim │ └──────────────────┘ │
│ └──────────────┘ │
│ │
│ Agent Loop (ReAct) │
│ ┌─────────────────────────────────────────────────┐ │
│ │ for i in range(max_iterations): │ │
│ │ response = llm(messages + tools) │ │
│ │ if no tool_calls: return final answer │ │
│ │ execute tools, append results │ │
│ │ loop back │ │
│ └─────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────┘
The agent loop is the same while loop from every previous post. The magic is in the combination of research-oriented tools + a system prompt that teaches methodology + a scratchpad that accumulates knowledge across iterations.
Key Takeaways
-
Deep search is iterative, not single-shot. The agent searches, reads, takes notes, identifies gaps, and searches again. This loop — not a better prompt — is what produces thorough answers.
-
The scratchpad is essential. Without it, the agent must keep all page content in conversation history, blowing the token budget. Notes compress 8000 characters of page content into 200 characters of key findings.
-
Query decomposition beats direct search. Breaking "What are the trade-offs of X?" into specific sub-questions produces better search results than searching the original question verbatim.
-
The system prompt encodes research methodology. You're not just listing tools — you're teaching the agent how to research: decompose, search, read deeply, take notes, identify gaps, iterate, synthesize with citations.
-
Token budget management is a first-class concern. Trim page content, use notes instead of full text, and set iteration limits. Deep search without budget controls will exhaust your context window.
-
Verification builds trust. Cross-referencing claims across sources — the same pattern from the verification post — ensures your agent doesn't amplify misinformation from a single source.
This is Part 4 of a series on AI agents. Part 1 covers function calling, Part 2 builds agents from scratch, and Part 3 adds verification loops.