Production notes from a builder who ships, operates, and occasionally breaks his own AI systems.
Latest
Most ML teams reach for MLflow before they need it and pay the operational tax for years. The custom model-registry pattern from a real labeling platform — version, compare, roll back models with a 200-line FastAPI service. When the DIY version is the right answer, and the three signals that say it isnt anymore.
Most LLM features ship on vibes. The first time you regret it is the day a prompt change quietly breaks half your traffic and nobody notices for a week. Here is what an honest eval harness actually contains - golden datasets, LLM-as-judge, prompt versioning - and the real cost of running it.
Three ways to hand data to an LLM agent: the Model Context Protocol, a boring REST API with an API key, or a curated Markdown file. Each is right some of the time and wrong a lot of the time. Here's the honest decision tree.
Every MCP server you connect loads its tool schemas into the context window before the first user turn. Here's the arithmetic on how expensive that gets, why most teams never measure it, and how to stop paying for tools the agent will never call.
Gemini 2M and Claude 1M made 'just paste it all' a real engineering option. Here's the cost math, the latency curve, the quiet failure mode of context dilution, and the rule for when stuffing beats RAG — and when it silently hurts.
You tuned the embedding model. You went hybrid. Your RAG still misses. The bug is upstream — in how you split documents. Five chunking strategies, when each wins, and how to actually evaluate them.
Every LLM-powered feature breaks the same way in production: the model returns almost-JSON. Markdown fences, trailing commas, a chatty preamble, a missing closing brace. Here's the 3-layer fix that ships — native structured outputs, Pydantic validation, and json_repair + retry loops.
Every RAG demo shows embeddings and stops there. Real production search almost always mixes keyword and semantic retrieval. Here's what's happening under the hood, why hybrid wins, and a runnable Postgres example in ~40 lines.
Graph databases look like the obvious answer for AI memory — entities, relationships, multi-hop queries. So why did OpenClaw, MemOS, and every shipping system pick flat markdown instead? A contrarian deep dive into the real tradeoffs.