AI Agents Crash Course
Learn how to build AI agents from scratch, starting with data ingestion and indexing from GitHub repositories
Introduction to AI Agents
Welcome to the AI Agents Crash Course! In this series, you'll learn how to build intelligent AI agents step by step.
What You'll Learn
This course covers the complete journey of building AI agents:
- Day 1: Ingest and index data from GitHub repositories
- Day 2: Chunking and intelligent processing
- Day 3: Search - Text, Vector, and Hybrid
- Day 4: Agents and Tools - Building the conversational agent
- Day 5: Evaluation - Testing and measuring agent quality
Prerequisites
- Basic Python knowledge
- Docker installed on your machine
- Familiarity with command line
Credits
This course is inspired by and based on the AI Hero program by Alexey Grigorev.
Code Repository
Follow along with the complete code at: GitHub Repository
Let's get started!
---PAGE---
Day 1: Ingest and Index Your Data
The first step in building any AI agent is gathering and processing data. Today we'll learn how to:
- Download repositories as zip archives
- Parse frontmatter metadata
- Extract content from markdown files
What is Frontmatter?
Frontmatter is a format used in markdown files where YAML metadata is placed at the top of the file between `---` markers. This is very useful because we can extract structured information (like title, tags, difficulty level) along with the content.
---
title: "Getting Started with AI"
author: "John Doe"
date: "2024-01-15"
tags: ["ai", "machine-learning", "tutorial"]
difficulty: "beginner"
---
# Getting Started with AI
This is the main content of the document written in **Markdown**.
You can include code blocks, links, and other formatting here.
Key Components:
- YAML Header: Metadata between `---` markers
- Content: Everything after the closing `---`
- Structured Data: Title, author, tags, etc.
---PAGE---
Downloading Repository Data
We can download entire GitHub repositories as zip archives using the GitHub codeload API.
import io
import zipfile
import requests
def read_repo_data(repo_owner: str, repo_name: str) -> list[dict]:
prefix = 'https://codeload.github.com'
url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
resp = requests.get(url)
if resp.status_code != 200:
raise Exception(f"Failed to download repository: {resp.status_code}")
zf = zipfile.ZipFile(io.BytesIO(resp.content))
# ... process files
zf.close()
Why This Approach?
- No need to clone the repository
- Works with any public repository
- Processes everything in memory
- Fast and efficient
---PAGE---
Day 2: Chunking and Intelligent Processing
Now that we can ingest data, we need to break it into manageable pieces. AI models have token limits, and large documents need to be chunked intelligently.
Today we'll learn about:
- Simple chunking: Fixed-size chunks without overlap
- Sliding window: Overlapping chunks for context preservation
- Paragraph-based: Split by natural paragraph breaks
- Section-based: Split by markdown headers
- LLM-based: AI-powered semantic chunking using OpenRouter
Why Chunking Matters
The key insight: start simple, evaluate, then iterate. Most use cases don't require sophisticated chunking methods.
---PAGE---
Real Benchmark Results
We tested all chunking methods on a sample markdown document (~1,900 characters). Here are the actual results:
Benchmark Comparison Table
| Method | Chunks | Avg Size | Min | Max | Has Topics |
|---|---|---|---|---|---|
| Simple | 4 | 472 chars | 387 | 500 | No |
| Sliding | 7 | 484 chars | 387 | 500 | No |
| Paragraph | 17 | 109 chars | 6 | 369 | No |
| Section | 3 | 538 chars | 239 | 956 | Header only |
| LLM | 8 | 234 chars | 124 | 369 | Yes |
Key Insights from Benchmark
Simple Chunking: Created 4 chunks with 3 full-size (500 chars) and 1 partial. Predictable but cuts mid-sentence.
Sliding Window: Created 7 chunks with 50% overlap. More chunks but better context preservation at boundaries.
Paragraph: Created 17 small chunks (avg 109 chars). Great for granular retrieval but many tiny chunks.
Section: Only 3 chunks following markdown headers. Best for well-structured documents with clear sections.
LLM (Gemini 2.0 Flash): Created 8 semantically meaningful chunks with topics like "Introduction", "FAQ Proposal Form - Steps", "Writing Good FAQ Guidelines". Each chunk is a complete thought.
---PAGE---
LLM Chunking Output Example
Chunk 1: Topic: "Introduction"
Content: # Contributing to DataTalksClub FAQ...
Chunk 2: Topic: "FAQ Proposal Form - Introduction"
Content: ## FAQ Proposal Form - We have an automated system...
Chunk 3: Topic: "FAQ Proposal Form - Steps"
Content: 1. Go to the FAQ Proposal form 2. Fill out the form...
Chunk 4: Topic: "FAQ Bot Actions"
Content: After that, our FAQ bot will: 1. Analyze your proposal...
Chunk 5: Topic: "FAQ Bot Outcomes"
Content: If it's NEW or UPDATE, the bot will create a PR...
Chunk 6: Topic: "Writing Good FAQ Guidelines - Question"
Content: ## Writing Good FAQ Guidelines - Question: Be specific...
Chunk 7: Topic: "Writing Good FAQ Guidelines - Answer"
Content: Answer - Start with a direct answer, Include code examples...
Chunk 8: Topic: "Questions or Issues"
Content: ## Questions or Issues? If you have questions...
---PAGE---
Running Day 2
Build and Run
docker-compose build day2
docker run --rm ai_agent_crashcoourse-day2
Demo All Rule-Based Methods
docker run --rm ai_agent_crashcoourse-day2 python main.py --demo
LLM-Based Chunking (OpenRouter)
# Demo LLM chunking
docker run --rm -e OPENROUTER_API_KEY=your_key \
ai_agent_crashcoourse-day2 python main.py --demo-llm
Default model: `google/gemini-2.0-flash-001` (fast and cheap)
---PAGE---
Day 2 Key Takeaways
Chunking Methods Comparison
| Method | Best For | Pros | Cons |
|---|---|---|---|
| Simple | Uniform processing | Predictable sizes | May cut mid-sentence |
| Sliding | General purpose | Context preservation | More chunks (overlap) |
| Paragraph | Natural text | Semantic units | Variable sizes |
| Section | Structured docs | Logical grouping | Needs markdown headers |
| LLM | High-quality RAG | Semantic understanding, topics | API cost, slower |
When to Use Each Method
- Start with Sliding Window: Good default for most use cases
- Use Section for well-structured markdown documentation
- Use Paragraph when you need granular retrieval
- Use LLM when chunk quality matters more than speed/cost
---PAGE---
Day 3: Search - Text, Vector, and Hybrid
Now that we have ingested and chunked our data, we need to make it searchable. Today we'll learn about three search approaches:
- Text Search (Lexical): Fast keyword matching
- Vector Search (Semantic): Find similar content using embeddings
- Hybrid Search: Combine both for best results
Key Insight
Always start with the simplest approach. For search, that's text search. It's fast, efficient, works well for exact matches, and requires no model inference. Add complexity (vector search) only when basic approaches prove insufficient.
---PAGE---
Text Search with Minsearch
Text search identifies documents containing query words. We use the minsearch library for fast lexical matching.
from minsearch import Index
# Create index with searchable fields
index = Index(
text_fields=["chunk", "title", "description", "filename"],
keyword_fields=[]
)
# Fit the index with our chunks
index.fit(chunks)
# Search!
results = index.search("how to detect data drift", num_results=5)
When to Use Text Search:
- Exact keyword matching
- Specific technical terms
- Fast queries without ML overhead
- When you know the exact terminology
---PAGE---
Vector Search with Embeddings
Vector search uses embeddings to find semantically similar content, even when different words are used.
from sentence_transformers import SentenceTransformer
import numpy as np
# Load embedding model
model = SentenceTransformer('multi-qa-distilbert-cos-v1')
# Create embeddings for all chunks
embeddings = []
for chunk in chunks:
v = model.encode(chunk['text'])
embeddings.append(v)
embeddings = np.array(embeddings)
Searching with Vectors
def vector_search(query, model, embeddings, chunks, num_results=5):
# Encode query
query_embedding = model.encode(query)
# Compute cosine similarity
similarities = np.dot(embeddings, query_embedding)
# Get top results
top_indices = np.argsort(similarities)[::-1][:num_results]
return [chunks[i] for i in top_indices]
When to Use Vector Search:
- Semantic similarity (find related concepts)
- When users don't know exact terminology
- Natural language queries
- Finding conceptually similar content
---PAGE---
Hybrid Search
Hybrid search combines both approaches by running text and vector searches, then deduplicating results.
def hybrid_search(query, text_index, model, embeddings, chunks):
# Get results from both methods
text_results = text_search(text_index, query)
vector_results = vector_search(query, model, embeddings, chunks)
# Deduplicate
seen_ids = set()
combined = []
for result in text_results + vector_results:
chunk_id = result['filename'] + result['chunk'][:50]
if chunk_id not in seen_ids:
seen_ids.add(chunk_id)
combined.append(result)
return combined
Benefits of Hybrid Search:
- Best of both worlds
- Catches exact matches AND semantic matches
- More robust retrieval
---PAGE---
Running Day 3
Build and Run
docker-compose build day3
docker run --rm ai_agent_crashcoourse-day3
Text Search Demo (Lightweight)
docker run --rm ai_agent_crashcoourse-day3 python main.py --demo-text
Full Demo (Downloads ~100MB embedding model)
docker run --rm ai_agent_crashcoourse-day3 python main.py --demo
---PAGE---
Day 3 Key Takeaways
Search Methods Comparison
| Method | Speed | Accuracy | Use Case |
|---|---|---|---|
| Text | Fast | Exact matches | Keywords, technical terms |
| Vector | Slower | Semantic | Natural language, concepts |
| Hybrid | Medium | Comprehensive | Production RAG systems |
Practical Recommendations
- Start with Text Search: It's fast, simple, and often sufficient
- Add Vector Search when users need semantic matching
- Use Hybrid for production RAG applications
- Model choice matters:
multi-qa-distilbert-cos-v1is a good balance of speed and quality
Next Steps
In Day 4, we'll learn how to:
- Connect to an LLM for answer generation
- Build a complete RAG pipeline
- Handle context and prompts effectively
---PAGE---
Day 4: Agents and Tools
An agent is fundamentally an LLM with the ability to invoke tools—external functions that enable information retrieval, calculations, or actions. Tools are what distinguish agents from basic chatbots.
Course Progress
- Day 1: Data download from GitHub
- Day 2: Data processing via chunking
- Day 3: Data indexing for searchability
- Day 4: Agent creation with tool access
Key Insight: Data preparation consumes most development time and represents the most critical component.
---PAGE---
Function Calling with OpenRouter
We use OpenRouter to access various LLMs. Developers must describe functions in a structured JSON format so LLMs understand how to invoke them:
import requests
import json
# Define the tool
text_search_tool = {
"type": "function",
"function": {
"name": "text_search",
"description": "Search the FAQ database for relevant information",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query text"
}
},
"required": ["query"]
}
}
}
# OpenRouter API call with tools
def call_agent(user_message, tools):
response = requests.post(
"https://openrouter.ai/api/v1/chat/completions",
headers={
"Authorization": f"Bearer {OPENROUTER_API_KEY}",
"Content-Type": "application/json"
},
json={
"model": "google/gemini-2.0-flash-001",
"messages": [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message}
],
"tools": tools
}
)
return response.json()
The agent analyzes user input, determines necessary tool calls, executes them, and generates context-aware responses.
---PAGE---
System Prompts Matter
Instructions significantly influence agent behavior. Enhanced prompts encourage strategic tool usage:
system_prompt = """You are a helpful FAQ assistant.
IMPORTANT RULES:
1. Always search for relevant information before answering
2. Make multiple searches if needed for comprehensive answers
3. If the search returns no results, say you don't have that information
4. Base your answers ONLY on search results, not training data
5. Be concise but thorough
When users ask questions, use the text_search tool to find relevant FAQ entries."""
Why System Prompts Are Critical
| Without Good Prompt | With Good Prompt |
|---|---|
| Agent answers from training data | Agent searches first, then answers |
| May hallucinate answers | Grounded in your actual data |
| Inconsistent tool usage | Reliably uses tools |
---PAGE---
Handling Tool Calls
When the LLM decides to use a tool, you need to execute it and send results back:
def run_agent_loop(user_question):
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_question}
]
while True:
response = call_openrouter(messages, tools=[text_search_tool])
message = response["choices"][0]["message"]
# Check if agent wants to call a tool
if message.get("tool_calls"):
for tool_call in message["tool_calls"]:
func_name = tool_call["function"]["name"]
func_args = json.loads(tool_call["function"]["arguments"])
# Execute the tool
if func_name == "text_search":
result = text_search(func_args["query"])
# Add tool result to messages
messages.append(message)
messages.append({
"role": "tool",
"tool_call_id": tool_call["id"],
"content": json.dumps(result)
})
else:
# No more tool calls, return final answer
return message["content"]
---PAGE---
Complete Agent Example
Here's a full working agent that searches your FAQ database:
import os
import json
import requests
from minsearch import Index
OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")
# Your search index (from Day 3)
index = Index(text_fields=["chunk", "title"], keyword_fields=[])
index.fit(chunks)
def text_search(query: str) -> list:
"""Search the FAQ database."""
results = index.search(query, num_results=3)
return [{"text": r["chunk"], "title": r["title"]} for r in results]
def run_faq_agent(question: str) -> str:
"""Run the FAQ agent with tool support."""
tools = [{
"type": "function",
"function": {
"name": "text_search",
"description": "Search FAQ database",
"parameters": {
"type": "object",
"properties": {
"query": {"type": "string", "description": "Search query"}
},
"required": ["query"]
}
}
}]
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": question}
]
# Agent loop
for _ in range(5): # Max 5 iterations
response = requests.post(
"https://openrouter.ai/api/v1/chat/completions",
headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
json={"model": "google/gemini-2.0-flash-001",
"messages": messages, "tools": tools}
).json()
msg = response["choices"][0]["message"]
if not msg.get("tool_calls"):
return msg["content"]
messages.append(msg)
for tc in msg["tool_calls"]:
args = json.loads(tc["function"]["arguments"])
result = text_search(args["query"])
messages.append({
"role": "tool",
"tool_call_id": tc["id"],
"content": json.dumps(result)
})
return "Max iterations reached"
---PAGE---
Running Day 4
Build and Run
docker-compose build day4
docker run --rm -e OPENROUTER_API_KEY=your_key ai_agent_crashcoourse-day4
Interactive Mode
docker run --rm -it -e OPENROUTER_API_KEY=your_key \\
ai_agent_crashcoourse-day4 python main.py --interactive
---PAGE---
Day 4 Key Takeaways
What Makes an Agent
| Chatbot | Agent |
|---|---|
| Responds from training data | Uses tools to access external data |
| Static knowledge | Dynamic knowledge retrieval |
| No actions | Can perform actions |
| Simple Q&A | Complex multi-step tasks |
Key Points
- Tools define "agentic" capabilities - the distinction between responding from training data vs accessing domain-specific knowledge dynamically
- System prompts are critical - they guide when and how the agent uses tools
- The agent loop - call LLM → check for tool calls → execute tools → repeat until done
- OpenRouter provides access to many models with a unified API
---PAGE---
Day 5: Evaluation
Is our agent actually good? Today we learn how to answer this question systematically.
Why Evaluation Matters
Evaluation is critical for building reliable AI systems. Without proper evaluation:
- You can't tell if changes improve or hurt performance
- You can't compare different approaches
- You can't build confidence before deploying to users
Key Concepts
- Vibe Check: Manual testing - interact with the agent and see if responses make sense
- Logging: Record all interactions for later analysis
- LLM as a Judge: Use AI to evaluate AI responses automatically
- Test Data Generation: Create test questions using AI
---PAGE---
Logging Agent Interactions
The first step is capturing data. We log every interaction:
import json
import secrets
from pathlib import Path
from datetime import datetime
LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)
def save_log(agent, messages, source="user"):
"""Save interaction to JSON file."""
entry = {
"agent_name": agent.name,
"system_prompt": agent.system_prompt,
"model": agent.model,
"messages": messages,
"source": source,
"timestamp": datetime.now().isoformat()
}
ts_str = datetime.now().strftime("%Y%m%d_%H%M%S")
rand_hex = secrets.token_hex(3)
filename = f"{agent.name}_{ts_str}_{rand_hex}.json"
with open(LOG_DIR / filename, "w") as f:
json.dump(entry, f, indent=2)
What to log:
- System prompt used
- Model name
- User query
- Tool calls and results
- Final response
- Source (user vs AI-generated)
---PAGE---
LLM as a Judge
Instead of manually evaluating hundreds of responses, we use AI to evaluate AI:
EVALUATION_PROMPT = """
Evaluate the AI agent's answer using this checklist:
- instructions_follow: Agent followed instructions
- answer_relevant: Response addresses the question
- answer_clear: Answer is clear and understandable
- answer_citations: Proper citations included
- completeness: Response covers key aspects
- tool_usage: Search tool used appropriately
For each check, output true/false with a brief explanation.
"""
def evaluate_interaction(question, answer, log):
"""Use LLM to evaluate agent response."""
prompt = f"""
<QUESTION>{question}</QUESTION>
<ANSWER>{answer}</ANSWER>
<LOG>{json.dumps(log)}</LOG>
"""
# Call evaluation LLM
result = call_llm(EVALUATION_PROMPT, prompt)
return result
Pro tip: Use a different model for evaluation than the one being evaluated. This reduces self-bias.
---PAGE---
Test Data Generation
We can use AI to generate realistic test questions:
QUESTION_GEN_PROMPT = """
Based on the FAQ content, generate realistic questions students might ask.
Questions should:
- Be natural and varied in style
- Range from simple to complex
- Include technical and general questions
Generate {num_questions} questions.
"""
def generate_test_questions(chunks, num_questions=10):
"""Generate test questions from FAQ content."""
sample = random.sample(chunks, min(10, len(chunks)))
content = json.dumps([c['chunk'] for c in sample])
result = call_llm(QUESTION_GEN_PROMPT, content)
return result['questions']
Benefits of AI-generated test data:
- Faster than manual creation
- Can cover edge cases
- Scales easily
Limitations:
- May not reflect real user behavior
- Could miss important edge cases only real users find
---PAGE---
Calculating Metrics
After evaluation, calculate pass rates for each check:
import pandas as pd
def calculate_metrics(eval_results):
"""Calculate pass rates from evaluation results."""
rows = []
for result in eval_results:
row = {"question": result["question"]}
for check in result["checks"]:
row[check["name"]] = check["pass"]
rows.append(row)
df = pd.DataFrame(rows)
return df
# Example output:
# instructions_follow 0.30 (30% pass rate)
# answer_relevant 1.00 (100% pass rate)
# answer_citations 0.30 (30% pass rate)
# tool_usage 1.00 (100% pass rate)
Key insight: The most important check is answer_relevant. If this score is low, your agent isn't ready.
---PAGE---
Running Day 5
Build and Run
docker-compose build day5
docker run --rm -e OPENROUTER_API_KEY=your_key ai_agent_crashcoourse-day5
Demo Individual Components
# Logging only
docker run --rm -e OPENROUTER_API_KEY=your_key \\
ai_agent_crashcoourse-day5 python main.py --demo-logging
# Evaluate existing logs
docker run --rm -e OPENROUTER_API_KEY=your_key \\
ai_agent_crashcoourse-day5 python main.py --demo-eval
# Generate test questions
docker run --rm -e OPENROUTER_API_KEY=your_key \\
ai_agent_crashcoourse-day5 python main.py --demo-generate
---PAGE---
Day 5 Key Takeaways
Evaluation Process
| Step | What | Why |
|---|---|---|
| 1. Log | Record all interactions | Build evaluation dataset |
| 2. Generate | Create test questions with AI | Scale test coverage |
| 3. Evaluate | LLM as a Judge | Automate quality checks |
| 4. Measure | Calculate metrics | Track improvements |
What You Can Do With Evaluation
- Decide if quality is good enough for deployment
- Compare different approaches (chunking, search, prompts)
- Track improvements over time
- Identify edge cases that need attention
Course Summary
You've now built a complete AI agent system:
- ✅ Ingest data from GitHub (Day 1)
- ✅ Chunk documents intelligently (Day 2)
- ✅ Search with text, vector, and hybrid methods (Day 3)
- ✅ Answer questions using tools and LLMs (Day 4)
- ✅ Evaluate and measure agent quality (Day 5)
Congratulations!