Prompt Versioning Without Langfuse — Three Lean Paths from prompts.yaml to Postgres
Langfuse, PromptLayer, and Promptfoo will all sell you a prompt registry. There are three reasonable ways to ship prompt versioning in 2026, and the choice is about team shape, not technology. This post walks through all three — Langfuse Cloud, prompts.yaml + git, and a 2-table Postgres schema with an 80-line FastAPI router — on the same customer-support classifier, running on Claude Haiku 4.5.
Prompt versioning is one of those things you'll hear is essential the moment your LLM feature touches production — and most teams shipping LLM features today are doing it with a string literal at the top of a Python file. Langfuse, PromptLayer, and Promptfoo will all sell you a prompt registry. So will your prompts.yaml. The question worth asking before you pick is what does a real prompt registry actually give you, and how much of that do you already have with files in git?
This post answers both. The same customer-support classifier, walked through three lean paths to a versioned prompt: Langfuse Cloud (the polished SaaS), prompts.yaml + git (the minimum), and Postgres + a small FastAPI router (the middle ground that matches Langfuse's shape on infrastructure you already run). All three call Claude Haiku 4.5 through the Anthropic SDK. The choice between them is about team shape, not technology — and seeing all three side by side is the cleanest way to know which one fits yours.
What is prompt versioning, in plain English
A prompt is a string. The moment that string is doing work in production — classifying support tickets, generating onboarding emails, scoring resumes, ranking search results — you need three things that a bare string in source code doesn't give you for free.
Versions. When someone changes the prompt, you want a labelled history: this is the prompt that ran for the first month, this is the one we shipped after the v2 rewrite, this is the one we rolled back to after the incident. Which version produced which output is the question you'll be asked in every postmortem.
Hot-swap. You want to be able to change the active prompt without a code deploy. Sometimes the prompt is the bug; the model is fine. A registry lets a non-engineer (product, support lead, prompt engineer) edit the live string and have the next request pick it up.
A/B labels. You want to run two prompts side by side on real traffic, measure the win-rate, and promote the better one. "Production" and "experiment" are the simplest version of this; full multi-armed bandit routing is the heavy version.
That's it. Three things. Tracing, eval, cost dashboards — those are different products that the same SaaS happens to sell alongside the registry. When this post talks about "prompt versioning" I mean those three jobs only.
Two things to keep in mind from the start.
It is not magic. Every registry boils down to "fetch a string by name, render variables into it, hand it to your LLM client." The whole problem fits on a napkin. The value of a real registry is the workflow it builds on top of that — a UI, an audit log, a label system, a non-engineer-friendly editor — not the technology.
It is also not always the right choice. For a team where the same three engineers write the prompts and ship the code, prompts.yaml and a pull request is already audit, version, and review in one tool. The registry is overhead. Knowing the difference is the point of learning it.
Three honest paths
There are three reasonable ways to ship prompt versioning in 2026, and the right pick is mostly about team shape and workflow — not technology. We'll walk through all three on the same customer-support classifier.
- The SaaS path — Langfuse Cloud. Polished UI, hosted tracing, A/B testing, free tier good enough to ship on. Right when non-engineers (PM, support lead, prompt engineer) need to edit prompts daily without a deploy.
- The minimum path —
prompts.yaml+ git. Five lines of YAML, a four-line regex renderer, a JSONL audit file. Right when the same people who write the prompts also ship the code. - The middle path — Postgres + an ~80-line FastAPI router. Real registry semantics — hot-swap, A/B labels, version history, audit table — on infrastructure you already run. Right when YAML's redeploy-to-change cost is too high but a third-party SaaS is overkill (data-residency rules, cost, vendor lock-in).
We'll start with Langfuse so you see what registry semantics actually feel like end-to-end, then strip down to YAML to show the napkin version, then sketch the middle Postgres path — the one that matches Langfuse's shape on code you own.
Install
For the registry version we'll use Langfuse Cloud, which has a genuinely free tier — sign up at langfuse.com, create a project, and grab a public key + secret key. Self-hosting is one docker-compose up and a brief note at the end of this post covers it.
For the LLM call we'll use Anthropic's Claude Haiku 4.5 — fast, cheap (~$1 per million input tokens), and a credit card is required, but the cost of running every example in this post is well under one cent.
uv venv && source .venv/bin/activate
uv pip install anthropic "langfuse>=3" pyyaml
Set three env vars:
export ANTHROPIC_API_KEY="sk-ant-..."
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_HOST="https://cloud.langfuse.com" # or your self-hosted URL
All the code in this post is paste-ready — copy each block into a file with the name above it and you're set.
The SaaS path — Langfuse Cloud (your first versioned prompt)
The job: a customer-support classifier that reads an email and labels it as billing, bug_report, feature_request, or general_question. We'll ship two prompt versions: a bare v1, then a v2 with label definitions and a couple of calibrating examples. v2 is the active production prompt; v1 stays in the registry as a fallback and a comparison target.
Here's the entire thing with Langfuse:
import json
from anthropic import Anthropic
from langfuse import get_client, observe
langfuse = get_client()
V1 = """You are a customer support triage system. Classify this email into exactly one of:
billing, bug_report, feature_request, general_question.
Email subject: {{ email_subject }}
Email body: {{ email_body }}
Respond with JSON only: {"label": "...", "reason": "..."}"""
V2 = """You are a customer support triage system. Classify the email into exactly one of these labels:
- billing — payments, invoices, refunds, subscription, card-charge issues
- bug_report — something is broken: errors, crashes, features that do not behave as documented
- feature_request — the user is asking for new functionality that does not exist yet
- general_question — anything else (how-to, pricing question, partnership)
Examples:
- "My card was charged twice" → billing
- "The export button does nothing" → bug_report
- "Could you add CSV import?" → feature_request
Email subject: {{ email_subject }}
Email body: {{ email_body }}
Respond with JSON only: {"label": "...", "reason": "<one sentence>"}"""
# One-time: upload both versions to the registry.
langfuse.create_prompt(name="classifier", type="text", prompt=V1, labels=["v1"])
langfuse.create_prompt(name="classifier", type="text", prompt=V2, labels=["production"])
@observe(as_type="generation")
def classify(email_subject: str, email_body: str, label: str = "production") -> dict:
prompt = langfuse.get_prompt("classifier", label=label)
rendered = prompt.compile(email_subject=email_subject, email_body=email_body)
msg = Anthropic().messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[
{"role": "user", "content": rendered},
{"role": "assistant", "content": "{"}, # prefill — force JSON, skip prose
],
)
result = json.loads("{" + msg.content[0].text)
# Attach model + token counts so Langfuse can compute cost and link the prompt.
langfuse.update_current_generation(
input={"email_subject": email_subject},
output=result,
model="claude-haiku-4-5",
usage_details={"input": msg.usage.input_tokens, "output": msg.usage.output_tokens},
prompt=prompt,
metadata={"prompt_label": label},
)
return result
if __name__ == "__main__":
out = classify(
email_subject="Refund for double charge",
email_body="My card was charged twice last week for the same order. Could you refund one?",
)
print(json.dumps(out, indent=2))
langfuse.flush()
Run it:
python classifier_langfuse.py
Output:
{
"label": "billing",
"reason": "The customer is reporting a duplicate charge and requesting a refund, which are billing issues."
}
Open the Langfuse dashboard, and you'll find a trace with the full input, the rendered prompt, the model's response, latency, token count, and a link back to the exact prompt version that ran. That trace is the value proposition.
What just happened, line by line
from langfuse import get_client, observe
Two things from the SDK. get_client() returns a configured Langfuse client (it reads the three env vars you set above). observe is a decorator that wraps any function in a trace — every call becomes a row in the Langfuse dashboard with timing, inputs, outputs, and a link to the prompt that ran.
V1 = "..." / V2 = "..."
Two prompt templates, written as Python strings. The {{ email_subject }} and {{ email_body }} placeholders are Langfuse's compilation syntax — Mustache-style double braces, no Jinja, no f-strings. The framework will substitute them at runtime.
langfuse.create_prompt(name="classifier", type="text", prompt=V1, labels=["v1"])
Upload the v1 template to the registry. name is the human-readable identifier you'll use to fetch it. type="text" means a single string (the alternative is "chat", which takes a list of {role, content} messages). labels=["v1"] attaches a label — labels are how you fetch a specific version later. Run this once and the prompt now lives on Langfuse, version-numbered automatically (this is version 1 of classifier).
langfuse.create_prompt(..., prompt=V2, labels=["production"])
Upload v2 with the production label. This is the one your app fetches by default. Now you have two versions in the registry — v1 with the label v1, v2 with the label production. They live alongside each other; either can be promoted or demoted by editing labels in the Langfuse UI without redeploying a single line of code.
@observe(as_type="generation")
Wrap classify so each call shows up on the Langfuse dashboard. The as_type="generation" argument tells Langfuse the wrapped function is an LLM call (not a generic Python span) — that's what enables model name, token counts, and cost computation on the trace. With a bare @observe(), Langfuse only sees "a function ran in 1.5s and returned this dict" and the trace row's cost column stays at zero.
prompt = langfuse.get_prompt("classifier", label=label)
Fetch the prompt by name and label. Default label is "production". The SDK caches the result for sixty seconds locally, so this is one HTTP round-trip per minute per process — not per request. Change the label assignment in the Langfuse UI and the next minute's traffic picks it up. That's the hot-swap mechanic.
rendered = prompt.compile(email_subject=..., email_body=...)
Substitute variables into the template. compile returns a plain string with the {{ var }} placeholders filled in. The values are not stored, only the template is.
Anthropic().messages.create(model="claude-haiku-4-5", ...)
A standard Anthropic SDK call. The registry doesn't replace your LLM client — it lives one layer above it. You can swap providers (Claude → GPT-4 → Gemini) without changing the prompt, and you can swap the prompt without changing the model.
The {"role": "assistant", "content": "{"} line — prefill.
Even when a prompt says "respond with JSON only", Claude sometimes wraps the response in markdown fences (```json ... ```) or leads with a sentence of prose. Both break json.loads. The fix is assistant prefill: append an extra message with role: "assistant" and content: "{", and the model treats { as the start of its own reply — it continues with the rest of the JSON object and nothing else. You prepend the { back when parsing. This is a canonical Anthropic technique, covered in the prompting best-practices post, and it's how you get reliable structured output without a tool-use schema.
langfuse.update_current_generation(model=..., usage_details=..., prompt=prompt, ...)
Attach the post-call telemetry to the active generation observation. model="claude-haiku-4-5" is what Langfuse multiplies against its price catalog to compute cost. usage_details={"input": N, "output": M} is the token count Anthropic returned (msg.usage.input_tokens / msg.usage.output_tokens). prompt=prompt is the actual TextPromptClient we fetched earlier — passing it back is what populates the Linked Generations tab inside the prompt's page: open classifier v2 in the Prompts UI, click Linked Generations, and you see every trace that ran that exact version. This is the registry-only feature your audit.log can't match without a UI — it's the bridge from "prompt #2 looks suspicious" to "show me every call it produced." If your model isn't in Langfuse's catalog yet (common for newly-released ones), you'd add it under Project Settings → Models, or pass cost_details={"input": 0.000001, "output": 0.000005} directly.
langfuse.flush()
Tell the SDK to push any pending traces to the server before the process exits. Without this, a short-lived script can finish before its traces are sent.
That's the whole program. Four calls — create_prompt, get_prompt, compile, update_current_generation — plus the @observe(as_type="generation") decorator. Everything else is the same code you'd write without Langfuse.
The minimum path — prompts.yaml + git
The fastest way to understand what Langfuse just did is to write the same classifier without it — the napkin version, all in your repo. Same Anthropic SDK call, same prompt content, same output. Only the prompt management is different.
Two files. First, prompts.yaml:
classifier:
active: v2
versions:
v1:
created: 2026-05-20
note: "First pass. No examples, no label definitions."
template: |
You are a customer support triage system. Classify this email into exactly one of:
billing, bug_report, feature_request, general_question.
Email subject: {{ email_subject }}
Email body: {{ email_body }}
Respond with JSON only: {"label": "...", "reason": "..."}
v2:
created: 2026-05-27
note: "Added label definitions and three calibrating examples."
template: |
You are a customer support triage system. Classify the email into exactly one of these labels:
- billing — payments, invoices, refunds, subscription, card-charge issues
- bug_report — something is broken: errors, crashes, features that do not behave as documented
- feature_request — the user is asking for new functionality that does not exist yet
- general_question — anything else (how-to, pricing question, partnership)
Examples:
- "My card was charged twice" → billing
- "The export button does nothing" → bug_report
- "Could you add CSV import?" → feature_request
Email subject: {{ email_subject }}
Email body: {{ email_body }}
Respond with JSON only: {"label": "...", "reason": "<one sentence>"}
Then classifier_lean.py:
import json, re, subprocess, sys, time
from pathlib import Path
import yaml
from anthropic import Anthropic
PROMPTS = Path(__file__).parent / "prompts.yaml"
AUDIT = Path(__file__).parent / "audit.log"
_VAR = re.compile(r"\{\{\s*(\w+)\s*\}\}")
def render(template: str, **vars) -> str:
return _VAR.sub(lambda m: str(vars[m.group(1)]), template)
def load_prompt(name: str, version: str | None = None) -> tuple[str, str]:
block = yaml.safe_load(PROMPTS.read_text())[name]
chosen = version or block["active"]
return block["versions"][chosen]["template"], chosen
def git_sha() -> str:
try:
return subprocess.check_output(
["git", "rev-parse", "--short", "HEAD"], text=True, stderr=subprocess.DEVNULL
).strip()
except Exception:
return "untracked"
def classify(email_subject: str, email_body: str, version: str | None = None) -> dict:
template, vlabel = load_prompt("classifier", version)
rendered = render(template, email_subject=email_subject, email_body=email_body)
msg = Anthropic().messages.create(
model="claude-haiku-4-5",
max_tokens=200,
messages=[
{"role": "user", "content": rendered},
{"role": "assistant", "content": "{"}, # prefill — force JSON, skip prose
],
)
result = json.loads("{" + msg.content[0].text)
with AUDIT.open("a") as f:
f.write(json.dumps({
"ts": time.time(), "git_sha": git_sha(), "prompt_version": vlabel,
"input_subject": email_subject, "output": result,
}) + "\n")
return result
if __name__ == "__main__":
out = classify(
email_subject="Refund for double charge",
email_body="My card was charged twice last week for the same order. Could you refund one?",
version=sys.argv[1] if len(sys.argv) > 1 else None,
)
print(json.dumps(out, indent=2))
Run it:
python classifier_lean.py # uses v2 (active in prompts.yaml)
python classifier_lean.py v1 # pin to v1 for a comparison run
Same JSON output. The Anthropic call is identical. The only difference is where the template came from — a YAML file in your repo instead of an HTTP call to Langfuse — and where the trace went — a local audit.log JSONL instead of a hosted dashboard.
About sixty lines for the lean version, seventy-five for the Langfuse one. Surprising? The registry version is actually slightly bigger, because the cost/usage telemetry has to be wired up explicitly — Langfuse doesn't auto-intercept the Anthropic SDK call. The "SaaS = less code" intuition is wrong here. What you're really trading is the kind of code you write — four chores move from your codebase to the platform:
- Variable rendering.
prompt.compile(**vars)substituted{{ email_subject }}. The lean version did the same job with a five-line regex helper. - The version index. Langfuse keeps a server-side record of every version with auto-increment numbers and a labels table. The lean version puts that record in YAML and lets git keep the history.
- The trace.
@observeand the implicitprompt.compilelinkage produced a row on the Langfuse dashboard with model, latency, tokens, prompt version, all stitched together. The lean version writes a JSONL line — you canjqit but there's no UI. - Hot-swap. Promoting v1 →
productionin Langfuse is a click in the UI; the next minute's traffic picks it up. In the lean version, yougit committhe change toactive: v1inprompts.yamland redeploy (or trigger a config reload, if your app watches the file).
The middle path — Postgres + an 80-line FastAPI router
YAML + git is enough for a three-engineer team. Langfuse Cloud is enough for a thirty-person team with a PM editing prompts daily. The gap between them is where most LLM-shipping teams actually live: eight engineers, one product person who wants to A/B-test a prompt without filing a ticket, and an existing Postgres you already run for the app. That gap is a problem Langfuse solves and YAML doesn't — and it's a problem you can solve in about eighty lines of FastAPI and a two-table Postgres schema, with your prompt data sitting on your own database instead of a third party's.
This section is the shape, not the full code. The runnable router + migration + Adminer recipe lives in the YouTube companion for this post — link going live with the video. What you see below is the load-bearing parts: the schema, the call signatures, and the audit query that produces Langfuse's Linked Generations view in one line of SQL.
The schema — two tables
CREATE TABLE prompts (
name TEXT,
version INT,
template TEXT NOT NULL,
labels TEXT[] DEFAULT '{}', -- {'production'} or {'v1','deprecated'}
note TEXT,
created_at TIMESTAMPTZ DEFAULT now(),
PRIMARY KEY (name, version)
);
CREATE INDEX prompts_labels_idx ON prompts USING GIN (labels);
CREATE TABLE prompt_calls (
id BIGSERIAL PRIMARY KEY,
ts TIMESTAMPTZ DEFAULT now(),
prompt_name TEXT NOT NULL,
prompt_version INT NOT NULL,
model TEXT NOT NULL,
input_tokens INT,
output_tokens INT,
output JSONB,
FOREIGN KEY (prompt_name, prompt_version) REFERENCES prompts(name, version)
);
Two tables do the entire job. prompts is the registry — one row per (name, version) pair, with a labels array ({'production'}, {'v1','deprecated'}, {'staging'}). The GIN index makes "fetch by label" a single index lookup. prompt_calls is the audit log — one row per LLM call, linked back to the exact (name, version) that produced it.
The router — four endpoints
@app.post("/prompts/{name}", status_code=201)
def create_version(name: str, body: NewPrompt) -> Prompt:
"""Insert a new version of `name`. version = COALESCE(max(version), 0) + 1."""
@app.get("/prompts/{name}")
def get_prompt(name: str, label: str = "production") -> Prompt:
"""SELECT * FROM prompts WHERE name=$1 AND $2 = ANY(labels) ORDER BY version DESC LIMIT 1"""
@app.post("/prompts/{name}/promote")
def promote(name: str, version: int, label: str) -> None:
"""Atomically move `label` from whatever version currently holds it onto `version`."""
@app.post("/calls")
def log_call(body: PromptCall) -> None:
"""INSERT INTO prompt_calls (...). Called from your app after each model call."""
promote is the hot-swap mechanic. Two UPDATEs in one transaction:
UPDATE prompts SET labels = array_remove(labels, $1) WHERE name = $2;
UPDATE prompts SET labels = array_append(labels, $1) WHERE name = $2 AND version = $3;
Run that and the next get_prompt(label='production') returns the new version. No deploy. No file watcher.
Editing without writing an admin UI
You don't need to ship a custom admin page for non-engineers. Adminer or PgWeb pointed at the same database is two minutes of setup and gives a product manager everything they need: list view of prompts, edit form for the template, atomic label updates by ticking a row. It's not as polished as Langfuse's UI — it's a Postgres admin tool — but the workflow (PM edits the prompt, hits save, next request picks it up) is identical.
The bridge column — Linked Generations in one SQL query
The trace bridge — "which calls ran with which prompt version" — is the part Langfuse's Linked Generations tab is selling. With this schema, it's a query:
SELECT ts, output_tokens, output->>'label' AS classification
FROM prompt_calls
WHERE prompt_name = 'classifier' AND prompt_version = 2
ORDER BY ts DESC
LIMIT 100;
Drop that into Adminer, save it as a bookmarked query, and you have the same answer Langfuse gives — "every call this prompt version produced" — without leaving your own infrastructure.
What you're trading
The 80 lines aren't free. You give up Langfuse's polished diff viewer, its built-in eval framework, the hosted dashboard with latency-percentile charts, and the Linked Generations tab as a clickable thing. What you get back is full control over the data, zero external dependency, and a registry whose entire footprint is two tables in a database you already maintain. For teams under data-residency rules ("the prompts are in the EU, full stop") or whose CFO asks pointed questions about another SaaS invoice, that tradeoff is the point.
The full FastAPI router, the Pydantic models, the migration, and the Adminer screenshots are in the YouTube walkthrough — coming next week.
So when is each one right?
- Use Langfuse (or any hosted registry) when non-engineers need to edit the live prompt without a deploy and you want the polished dashboard / eval / Linked Generations UI as a click instead of a SQL query; you're already paying for hosted tracing and the registry is bundled; or your audit/compliance requirements demand a centralized log outside your app's database.
- Use Postgres + the FastAPI router when you need real registry semantics (hot-swap, A/B labels, audit) but the data has to stay on your infrastructure (EU residency, regulated industry, big-customer contracts); you already run Postgres for the app; or you don't want another vendor in your bill.
- Use
prompts.yaml+ git when the same engineers who write the prompts also ship the code; your traffic is small enough that the JSONL audit log is searchable withjq; you'd rather see prompt changes in code review (alongside the call-site changes that depend on them) than in a separate UI; or your stack is allergic to one more dependency.
None of these is "better" in the abstract. All three produce the same classification. They differ in which workflow they make cheap. Pick the one whose workflow fits your team's actual prompt-iteration cadence.
The single signal that tells you to upgrade from YAML to a registry is when a non-engineer is in your Slack DMs asking "can you tweak the prompt real quick." If that's happening weekly, you want a registry. Whether it's Langfuse or your own 80 lines of FastAPI is a separate question — answered by where the data needs to live and what your appetite for another SaaS bill is.
The three things a prompt registry actually gives you
Strip away the marketing pages and every registry — Langfuse, PromptLayer, Promptfoo, BrainTrust, Helicone Prompts, Weave — sells you the same three primitives:
Named, versioned strings. Fetch by name plus a version selector (label or version number). Auto-increment versions. A labels table that maps "production", "staging", "experiment" to specific versions, editable from a UI.
Variable compilation. Render {{ var }} placeholders without your app code doing the substitution. Always Mustache-style or Jinja-lite; no one invents a new templating language for this.
A trace link. Every model call carries a back-reference to the exact prompt version that ran. That link is the bridge from "the model said something weird in production" to "here's the input, the prompt, the output, the version."
Almost every advanced feature you'll read about — A/B routing, fallback chains, prompt evals, drift detection, cost analytics — is a layer built on top of those three. The YAML path covers the first two natively (the third needs jq plus discipline). The Postgres path covers all three with a foreign key. Langfuse covers all three with a UI. The technology choice is which workflow you want, not whether the primitives exist.
Why people reach for Langfuse
Four reasons explain most of the adoption.
Hot-swap without redeploy. Promoting a new prompt version to production is a click in the UI, and the next request picks it up (modulo the client cache TTL, default sixty seconds). For teams where prompt iteration is a daily activity and a deploy cycle is fifteen minutes, this is the single largest workflow win.
Tracing and eval in the same product. Langfuse ships prompt management, trace storage, and an eval framework as one platform. Every @observe-wrapped call shows up on a dashboard with token cost, latency, and a button to mark the output as good/bad for offline eval. If you already want hosted tracing, the registry is a free side-effect.
Open source plus a managed cloud. Langfuse is MIT-licensed and self-hostable on a single docker-compose up — you can run it on the same VM as your app and have zero external dependency. Most other registries are SaaS-only. For Swiss/EU teams with data-residency rules, self-hosting matters.
Non-engineer workflow. The UI is good enough that a product manager, support lead, or prompt engineer can edit prompts, view diffs, and promote versions without touching code or filing a PR. That's the headline feature that separates a registry from a YAML file — and the only one that actually matters for some teams.
What it isn't
Langfuse (and every registry I named above) is not a substitute for offline evals. The dashboard will show you that v2 has a higher thumbs-up rate than v1 on the last thousand production calls, but it won't tell you what would have happened if you'd shipped v3 on the same inputs. Offline eval — a fixed input set, a judge model, regression detection in CI — is a different job. Langfuse has an eval feature, and some teams use it, but the prompt registry and the eval framework are separable concerns and most teams end up using two tools.
It is also not a templating engine in the Jinja sense. Compilation is single-pass variable substitution. No loops, no conditionals, no inheritance. If you need to render a list of N items into a prompt, you do that in Python before calling compile, not in the template itself. That's a deliberate constraint — it keeps the template visible to non-engineers without forcing them to learn a programming language.
And it is not secret-safe out of the box. Your prompts probably contain product names, internal terminology, and sometimes pieces of customer data in the examples. Treat the registry's access controls the same way you'd treat any production database. Langfuse Cloud sits in the EU (Frankfurt) for the EU project; check the residency before you upload anything sensitive.
Other tools you'll hear about
The registry space is crowded. A quick map of who does what:
- PromptLayer. Closed-source SaaS, the original prompt-versioning product. Strong A/B testing, good UI. Pricing: free tier capped at 1,000 logged calls/month, paid plans from $50/month.
- Promptfoo. Open-source (MIT), CLI-first. Strongest tool for eval, weaker on registry — most teams pair it with something else. Free.
- BrainTrust. Closed-source SaaS, enterprise-flavored. Heavy on eval, log analytics, and human-in-the-loop review. Pricing on request.
- Helicone Prompts. Open-source (Apache 2.0), bundled with Helicone's LLM proxy. Good fit if you're already using Helicone for cost tracking.
- Weave (W&B). Closed-source, part of Weights & Biases. Strongest if your team already uses W&B for ML experiment tracking; otherwise it's a heavy lift.
- PromptHub, Agenta, OpenPrompt. Smaller players. Worth a look if Langfuse doesn't fit and you don't want to roll your own.
If you want the closest 1:1 alternative to Langfuse with a different operational shape, Helicone is the one to evaluate next. If you want to stop using a registry, the prompts.yaml pattern above is the one to compare against.
What to try next
Four concrete next steps in order of effort:
-
Run the YAML and Langfuse versions yourself. Paste the two scripts and the YAML file above into a folder, set the env vars, run
python classifier_langfuse.pyandpython classifier_lean.py. Diff the output. Look at the Langfuse trace next to the localaudit.log. The right choice will become obvious within ten minutes. -
Promote v1 to production from the Langfuse UI without touching code. Then re-run
classifier_langfuse.py. The output will change. That's the hot-swap value proposition in twenty seconds. -
Sketch the Postgres path on paper. Take the two-table schema above. Picture which of your prompts you'd promote first. Picture which non-engineer you'd give Adminer access to. If you can name them, you have the problem the middle path solves — and the full router walkthrough is on its way as a YouTube companion to this post.
-
Self-host Langfuse. It is genuinely one
docker-compose up. The project ships a compose file that runs the app, Postgres, and Clickhouse. PointLANGFUSE_HOSTathttp://localhost:3000and the SDK doesn't know the difference. This is the path for EU teams with data-residency concerns who do want a polished UI but not a third-party SaaS bill.
What's next on this blog
This is the second post in the Lean AI Stack series — replace one named piece of the AI-builder SaaS stack with code you own. The first was Build Your Own Model Registry in a Weekend — FastAPI + Postgres, No MLflow. The next two on the runway:
- LLM Gateway in 100 Lines — Drop Helicone and Portkey. A FastAPI proxy with cost logging, retry, model routing, and prompt-cache headers. Replaces the LLM-observability SaaS stack for teams that don't need a UI.
- Replace Mem0 with pgvector — Agent Memory in 60 Lines. Long-term conversation memory with vector search and Postgres recency weighting. Why managed-memory SaaS is usually overkill for production agents.
If you found this one useful, the YouTube channel has the 5-minute explainer cuts and the Telegram channel gets the post the moment it's live.