AI Agents Crash Course

Introduction to AI Agents

Welcome to the AI Agents Crash Course! In this series, you'll learn how to build intelligent AI agents step by step.

What You'll Learn

This course covers the complete journey of building AI agents:

Day 1: Ingest and index data from GitHub repositories
Day 2: Chunking and intelligent processing
Day 3: Search - Text, Vector, and Hybrid
Day 4: Agents and Tools - Building the conversational agent
Day 5: Evaluation - Testing and measuring agent quality

Prerequisites

Basic Python knowledge
Docker installed on your machine
Familiarity with command line

Credits

This course is inspired by and based on the AI Hero program by Alexey Grigorev.

Code Repository

Follow along with the complete code at: GitHub Repository

Let's get started!

---PAGE---

Day 1: Ingest and Index Your Data

The first step in building any AI agent is gathering and processing data. Today we'll learn how to:

Download repositories as zip archives
Parse frontmatter metadata
Extract content from markdown files

What is Frontmatter?

Frontmatter is a format used in markdown files where YAML metadata is placed at the top of the file between `---` markers. This is very useful because we can extract structured information (like title, tags, difficulty level) along with the content.

---
title: "Getting Started with AI"
author: "John Doe"
date: "2024-01-15"
tags: ["ai", "machine-learning", "tutorial"]
difficulty: "beginner"
---

# Getting Started with AI

This is the main content of the document written in **Markdown**.

You can include code blocks, links, and other formatting here.

Key Components:

YAML Header: Metadata between `---` markers
Content: Everything after the closing `---`
Structured Data: Title, author, tags, etc.

---PAGE---

Downloading Repository Data

We can download entire GitHub repositories as zip archives using the GitHub codeload API.

import io
import zipfile
import requests

def read_repo_data(repo_owner: str, repo_name: str) -> list[dict]:
    prefix = 'https://codeload.github.com'
    url = f'{prefix}/{repo_owner}/{repo_name}/zip/refs/heads/main'
    resp = requests.get(url)

    if resp.status_code != 200:
        raise Exception(f"Failed to download repository: {resp.status_code}")

    zf = zipfile.ZipFile(io.BytesIO(resp.content))
    # ... process files
    zf.close()

Why This Approach?

No need to clone the repository
Works with any public repository
Processes everything in memory
Fast and efficient

---PAGE---

Day 2: Chunking and Intelligent Processing

Now that we can ingest data, we need to break it into manageable pieces. AI models have token limits, and large documents need to be chunked intelligently.

Today we'll learn about:

Simple chunking: Fixed-size chunks without overlap
Sliding window: Overlapping chunks for context preservation
Paragraph-based: Split by natural paragraph breaks
Section-based: Split by markdown headers
LLM-based: AI-powered semantic chunking using OpenRouter

Why Chunking Matters

The key insight: start simple, evaluate, then iterate. Most use cases don't require sophisticated chunking methods.

---PAGE---

Real Benchmark Results

We tested all chunking methods on a sample markdown document (~1,900 characters). Here are the actual results:

Benchmark Comparison Table

Method	Chunks	Avg Size	Min	Max	Has Topics
Simple	4	472 chars	387	500	No
Sliding	7	484 chars	387	500	No
Paragraph	17	109 chars	6	369	No
Section	3	538 chars	239	956	Header only
LLM	8	234 chars	124	369	Yes

Key Insights from Benchmark

Simple Chunking: Created 4 chunks with 3 full-size (500 chars) and 1 partial. Predictable but cuts mid-sentence.

Sliding Window: Created 7 chunks with 50% overlap. More chunks but better context preservation at boundaries.

Paragraph: Created 17 small chunks (avg 109 chars). Great for granular retrieval but many tiny chunks.

Section: Only 3 chunks following markdown headers. Best for well-structured documents with clear sections.

LLM (Gemini 2.0 Flash): Created 8 semantically meaningful chunks with topics like "Introduction", "FAQ Proposal Form - Steps", "Writing Good FAQ Guidelines". Each chunk is a complete thought.

---PAGE---

LLM Chunking Output Example

Chunk 1: Topic: "Introduction"
  Content: # Contributing to DataTalksClub FAQ...

Chunk 2: Topic: "FAQ Proposal Form - Introduction"
  Content: ## FAQ Proposal Form - We have an automated system...

Chunk 3: Topic: "FAQ Proposal Form - Steps"
  Content: 1. Go to the FAQ Proposal form 2. Fill out the form...

Chunk 4: Topic: "FAQ Bot Actions"
  Content: After that, our FAQ bot will: 1. Analyze your proposal...

Chunk 5: Topic: "FAQ Bot Outcomes"
  Content: If it's NEW or UPDATE, the bot will create a PR...

Chunk 6: Topic: "Writing Good FAQ Guidelines - Question"
  Content: ## Writing Good FAQ Guidelines - Question: Be specific...

Chunk 7: Topic: "Writing Good FAQ Guidelines - Answer"
  Content: Answer - Start with a direct answer, Include code examples...

Chunk 8: Topic: "Questions or Issues"
  Content: ## Questions or Issues? If you have questions...

---PAGE---

Running Day 2

Build and Run

docker-compose build day2
docker run --rm ai_agent_crashcoourse-day2

Demo All Rule-Based Methods

docker run --rm ai_agent_crashcoourse-day2 python main.py --demo

LLM-Based Chunking (OpenRouter)

# Demo LLM chunking
docker run --rm -e OPENROUTER_API_KEY=your_key \
  ai_agent_crashcoourse-day2 python main.py --demo-llm

Default model: `google/gemini-2.0-flash-001` (fast and cheap)

---PAGE---

Day 2 Key Takeaways

Chunking Methods Comparison

Method	Best For	Pros	Cons
Simple	Uniform processing	Predictable sizes	May cut mid-sentence
Sliding	General purpose	Context preservation	More chunks (overlap)
Paragraph	Natural text	Semantic units	Variable sizes
Section	Structured docs	Logical grouping	Needs markdown headers
LLM	High-quality RAG	Semantic understanding, topics	API cost, slower

When to Use Each Method

Start with Sliding Window: Good default for most use cases
Use Section for well-structured markdown documentation
Use Paragraph when you need granular retrieval
Use LLM when chunk quality matters more than speed/cost

---PAGE---

Day 3: Search - Text, Vector, and Hybrid

Now that we have ingested and chunked our data, we need to make it searchable. Today we'll learn about three search approaches:

Text Search (Lexical): Fast keyword matching
Vector Search (Semantic): Find similar content using embeddings
Hybrid Search: Combine both for best results

Key Insight

Always start with the simplest approach. For search, that's text search. It's fast, efficient, works well for exact matches, and requires no model inference. Add complexity (vector search) only when basic approaches prove insufficient.

---PAGE---

Text Search with Minsearch

Text search identifies documents containing query words. We use the minsearch library for fast lexical matching.

from minsearch import Index

# Create index with searchable fields
index = Index(
    text_fields=["chunk", "title", "description", "filename"],
    keyword_fields=[]
)

# Fit the index with our chunks
index.fit(chunks)

# Search!
results = index.search("how to detect data drift", num_results=5)

When to Use Text Search:

Exact keyword matching
Specific technical terms
Fast queries without ML overhead
When you know the exact terminology

---PAGE---

Vector Search with Embeddings

Vector search uses embeddings to find semantically similar content, even when different words are used.

from sentence_transformers import SentenceTransformer
import numpy as np

# Load embedding model
model = SentenceTransformer('multi-qa-distilbert-cos-v1')

# Create embeddings for all chunks
embeddings = []
for chunk in chunks:
    v = model.encode(chunk['text'])
    embeddings.append(v)

embeddings = np.array(embeddings)

Searching with Vectors

def vector_search(query, model, embeddings, chunks, num_results=5):
    # Encode query
    query_embedding = model.encode(query)

    # Compute cosine similarity
    similarities = np.dot(embeddings, query_embedding)

    # Get top results
    top_indices = np.argsort(similarities)[::-1][:num_results]
    return [chunks[i] for i in top_indices]

When to Use Vector Search:

Semantic similarity (find related concepts)
When users don't know exact terminology
Natural language queries
Finding conceptually similar content

---PAGE---

Hybrid Search

Hybrid search combines both approaches by running text and vector searches, then deduplicating results.

def hybrid_search(query, text_index, model, embeddings, chunks):
    # Get results from both methods
    text_results = text_search(text_index, query)
    vector_results = vector_search(query, model, embeddings, chunks)

    # Deduplicate
    seen_ids = set()
    combined = []

    for result in text_results + vector_results:
        chunk_id = result['filename'] + result['chunk'][:50]
        if chunk_id not in seen_ids:
            seen_ids.add(chunk_id)
            combined.append(result)

    return combined

Benefits of Hybrid Search:

Best of both worlds
Catches exact matches AND semantic matches
More robust retrieval

---PAGE---

Running Day 3

Build and Run

docker-compose build day3
docker run --rm ai_agent_crashcoourse-day3

Text Search Demo (Lightweight)

docker run --rm ai_agent_crashcoourse-day3 python main.py --demo-text

Full Demo (Downloads ~100MB embedding model)

docker run --rm ai_agent_crashcoourse-day3 python main.py --demo

---PAGE---

Day 3 Key Takeaways

Search Methods Comparison

Method	Speed	Accuracy	Use Case
Text	Fast	Exact matches	Keywords, technical terms
Vector	Slower	Semantic	Natural language, concepts
Hybrid	Medium	Comprehensive	Production RAG systems

Practical Recommendations

Start with Text Search: It's fast, simple, and often sufficient
Add Vector Search when users need semantic matching
Use Hybrid for production RAG applications
Model choice matters: multi-qa-distilbert-cos-v1 is a good balance of speed and quality

Next Steps

In Day 4, we'll learn how to:

Connect to an LLM for answer generation
Build a complete RAG pipeline
Handle context and prompts effectively

---PAGE---

Day 4: Agents and Tools

An agent is fundamentally an LLM with the ability to invoke tools—external functions that enable information retrieval, calculations, or actions. Tools are what distinguish agents from basic chatbots.

Course Progress

Day 1: Data download from GitHub
Day 2: Data processing via chunking
Day 3: Data indexing for searchability
Day 4: Agent creation with tool access

Key Insight: Data preparation consumes most development time and represents the most critical component.

---PAGE---

Function Calling with OpenRouter

We use OpenRouter to access various LLMs. Developers must describe functions in a structured JSON format so LLMs understand how to invoke them:

import requests
import json

# Define the tool
text_search_tool = {
    "type": "function",
    "function": {
        "name": "text_search",
        "description": "Search the FAQ database for relevant information",
        "parameters": {
            "type": "object",
            "properties": {
                "query": {
                    "type": "string",
                    "description": "Search query text"
                }
            },
            "required": ["query"]
        }
    }
}

# OpenRouter API call with tools
def call_agent(user_message, tools):
    response = requests.post(
        "https://openrouter.ai/api/v1/chat/completions",
        headers={
            "Authorization": f"Bearer {OPENROUTER_API_KEY}",
            "Content-Type": "application/json"
        },
        json={
            "model": "google/gemini-2.0-flash-001",
            "messages": [
                {"role": "system", "content": system_prompt},
                {"role": "user", "content": user_message}
            ],
            "tools": tools
        }
    )
    return response.json()

The agent analyzes user input, determines necessary tool calls, executes them, and generates context-aware responses.

---PAGE---

System Prompts Matter

Instructions significantly influence agent behavior. Enhanced prompts encourage strategic tool usage:

system_prompt = """You are a helpful FAQ assistant.

IMPORTANT RULES:
1. Always search for relevant information before answering
2. Make multiple searches if needed for comprehensive answers
3. If the search returns no results, say you don't have that information
4. Base your answers ONLY on search results, not training data
5. Be concise but thorough

When users ask questions, use the text_search tool to find relevant FAQ entries."""

Why System Prompts Are Critical

Without Good Prompt	With Good Prompt
Agent answers from training data	Agent searches first, then answers
May hallucinate answers	Grounded in your actual data
Inconsistent tool usage	Reliably uses tools

---PAGE---

Handling Tool Calls

When the LLM decides to use a tool, you need to execute it and send results back:

def run_agent_loop(user_question):
    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_question}
    ]

    while True:
        response = call_openrouter(messages, tools=[text_search_tool])
        message = response["choices"][0]["message"]

        # Check if agent wants to call a tool
        if message.get("tool_calls"):
            for tool_call in message["tool_calls"]:
                func_name = tool_call["function"]["name"]
                func_args = json.loads(tool_call["function"]["arguments"])

                # Execute the tool
                if func_name == "text_search":
                    result = text_search(func_args["query"])

                # Add tool result to messages
                messages.append(message)
                messages.append({
                    "role": "tool",
                    "tool_call_id": tool_call["id"],
                    "content": json.dumps(result)
                })
        else:
            # No more tool calls, return final answer
            return message["content"]

---PAGE---

Complete Agent Example

Here's a full working agent that searches your FAQ database:

import os
import json
import requests
from minsearch import Index

OPENROUTER_API_KEY = os.environ.get("OPENROUTER_API_KEY")

# Your search index (from Day 3)
index = Index(text_fields=["chunk", "title"], keyword_fields=[])
index.fit(chunks)

def text_search(query: str) -> list:
    """Search the FAQ database."""
    results = index.search(query, num_results=3)
    return [{"text": r["chunk"], "title": r["title"]} for r in results]

def run_faq_agent(question: str) -> str:
    """Run the FAQ agent with tool support."""

    tools = [{
        "type": "function",
        "function": {
            "name": "text_search",
            "description": "Search FAQ database",
            "parameters": {
                "type": "object",
                "properties": {
                    "query": {"type": "string", "description": "Search query"}
                },
                "required": ["query"]
            }
        }
    }]

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": question}
    ]

    # Agent loop
    for _ in range(5):  # Max 5 iterations
        response = requests.post(
            "https://openrouter.ai/api/v1/chat/completions",
            headers={"Authorization": f"Bearer {OPENROUTER_API_KEY}"},
            json={"model": "google/gemini-2.0-flash-001",
                  "messages": messages, "tools": tools}
        ).json()

        msg = response["choices"][0]["message"]

        if not msg.get("tool_calls"):
            return msg["content"]

        messages.append(msg)
        for tc in msg["tool_calls"]:
            args = json.loads(tc["function"]["arguments"])
            result = text_search(args["query"])
            messages.append({
                "role": "tool",
                "tool_call_id": tc["id"],
                "content": json.dumps(result)
            })

    return "Max iterations reached"

---PAGE---

Running Day 4

Build and Run

docker-compose build day4
docker run --rm -e OPENROUTER_API_KEY=your_key ai_agent_crashcoourse-day4

Interactive Mode

docker run --rm -it -e OPENROUTER_API_KEY=your_key \\
  ai_agent_crashcoourse-day4 python main.py --interactive

---PAGE---

Day 4 Key Takeaways

What Makes an Agent

Chatbot	Agent
Responds from training data	Uses tools to access external data
Static knowledge	Dynamic knowledge retrieval
No actions	Can perform actions
Simple Q&A	Complex multi-step tasks

Key Points

Tools define "agentic" capabilities - the distinction between responding from training data vs accessing domain-specific knowledge dynamically
System prompts are critical - they guide when and how the agent uses tools
The agent loop - call LLM → check for tool calls → execute tools → repeat until done
OpenRouter provides access to many models with a unified API

---PAGE---

Day 5: Evaluation

Is our agent actually good? Today we learn how to answer this question systematically.

Why Evaluation Matters

Evaluation is critical for building reliable AI systems. Without proper evaluation:

You can't tell if changes improve or hurt performance
You can't compare different approaches
You can't build confidence before deploying to users

Key Concepts

Vibe Check: Manual testing - interact with the agent and see if responses make sense
Logging: Record all interactions for later analysis
LLM as a Judge: Use AI to evaluate AI responses automatically
Test Data Generation: Create test questions using AI

---PAGE---

Logging Agent Interactions

The first step is capturing data. We log every interaction:

import json
import secrets
from pathlib import Path
from datetime import datetime

LOG_DIR = Path('logs')
LOG_DIR.mkdir(exist_ok=True)

def save_log(agent, messages, source="user"):
    """Save interaction to JSON file."""
    entry = {
        "agent_name": agent.name,
        "system_prompt": agent.system_prompt,
        "model": agent.model,
        "messages": messages,
        "source": source,
        "timestamp": datetime.now().isoformat()
    }

    ts_str = datetime.now().strftime("%Y%m%d_%H%M%S")
    rand_hex = secrets.token_hex(3)
    filename = f"{agent.name}_{ts_str}_{rand_hex}.json"

    with open(LOG_DIR / filename, "w") as f:
        json.dump(entry, f, indent=2)

What to log:

System prompt used
Model name
User query
Tool calls and results
Final response
Source (user vs AI-generated)

---PAGE---

LLM as a Judge

Instead of manually evaluating hundreds of responses, we use AI to evaluate AI:

EVALUATION_PROMPT = """
Evaluate the AI agent's answer using this checklist:

- instructions_follow: Agent followed instructions
- answer_relevant: Response addresses the question
- answer_clear: Answer is clear and understandable
- answer_citations: Proper citations included
- completeness: Response covers key aspects
- tool_usage: Search tool used appropriately

For each check, output true/false with a brief explanation.
"""

def evaluate_interaction(question, answer, log):
    """Use LLM to evaluate agent response."""
    prompt = f"""
    <QUESTION>{question}</QUESTION>
    <ANSWER>{answer}</ANSWER>
    <LOG>{json.dumps(log)}</LOG>
    """

    # Call evaluation LLM
    result = call_llm(EVALUATION_PROMPT, prompt)
    return result

Pro tip: Use a different model for evaluation than the one being evaluated. This reduces self-bias.

---PAGE---

Test Data Generation

We can use AI to generate realistic test questions:

QUESTION_GEN_PROMPT = """
Based on the FAQ content, generate realistic questions students might ask.

Questions should:
- Be natural and varied in style
- Range from simple to complex
- Include technical and general questions

Generate {num_questions} questions.
"""

def generate_test_questions(chunks, num_questions=10):
    """Generate test questions from FAQ content."""
    sample = random.sample(chunks, min(10, len(chunks)))
    content = json.dumps([c['chunk'] for c in sample])

    result = call_llm(QUESTION_GEN_PROMPT, content)
    return result['questions']

Benefits of AI-generated test data:

Faster than manual creation
Can cover edge cases
Scales easily

Limitations:

May not reflect real user behavior
Could miss important edge cases only real users find

---PAGE---

Calculating Metrics

After evaluation, calculate pass rates for each check:

import pandas as pd

def calculate_metrics(eval_results):
    """Calculate pass rates from evaluation results."""
    rows = []

    for result in eval_results:
        row = {"question": result["question"]}
        for check in result["checks"]:
            row[check["name"]] = check["pass"]
        rows.append(row)

    df = pd.DataFrame(rows)
    return df

# Example output:
# instructions_follow    0.30  (30% pass rate)
# answer_relevant        1.00  (100% pass rate)
# answer_citations       0.30  (30% pass rate)
# tool_usage             1.00  (100% pass rate)

Key insight: The most important check is answer_relevant. If this score is low, your agent isn't ready.

---PAGE---

Running Day 5

Build and Run

docker-compose build day5
docker run --rm -e OPENROUTER_API_KEY=your_key ai_agent_crashcoourse-day5

Demo Individual Components

# Logging only
docker run --rm -e OPENROUTER_API_KEY=your_key \\
  ai_agent_crashcoourse-day5 python main.py --demo-logging

# Evaluate existing logs
docker run --rm -e OPENROUTER_API_KEY=your_key \\
  ai_agent_crashcoourse-day5 python main.py --demo-eval

# Generate test questions
docker run --rm -e OPENROUTER_API_KEY=your_key \\
  ai_agent_crashcoourse-day5 python main.py --demo-generate

---PAGE---

Day 5 Key Takeaways

Evaluation Process

Step	What	Why
1. Log	Record all interactions	Build evaluation dataset
2. Generate	Create test questions with AI	Scale test coverage
3. Evaluate	LLM as a Judge	Automate quality checks
4. Measure	Calculate metrics	Track improvements

What You Can Do With Evaluation

Decide if quality is good enough for deployment
Compare different approaches (chunking, search, prompts)
Track improvements over time
Identify edge cases that need attention

Course Summary

You've now built a complete AI agent system:

✅ Ingest data from GitHub (Day 1)
✅ Chunk documents intelligently (Day 2)
✅ Search with text, vector, and hybrid methods (Day 3)
✅ Answer questions using tools and LLMs (Day 4)
✅ Evaluate and measure agent quality (Day 5)

Congratulations!