Stop Vibe-Checking Your Prompts - Building an Eval Harness That Catches Regressions Before Your Users Do

Most LLM features ship on vibes. The first time you regret it is the day a prompt change quietly breaks half your traffic and nobody notices for a week. Here is what an honest eval harness actually contains - golden datasets, LLM-as-judge, prompt versioning - and the real cost of running it.

Banner

A team ships a prompt change on a Tuesday. The change improves one flow and silently breaks three others.

The model returns valid JSON. The structure looks right. Someone on the team eyeballs a couple of outputs, nods, and merges. The deploy goes out clean. No alerts fire.

The first user complaint arrives on Friday. By the time anyone traces it back to the Tuesday commit, roughly six hundred customer queries have received responses with a hallucinated field that downstream code happily consumed. Nothing was wrong, exactly. The system had just been quietly less correct.

This is the failure mode of vibe-checking prompts: it doesn't fail loudly, and the bugs that ship are the kind nobody notices until production has been wrong for a week. The fix is not more careful eyeballing. The fix is an eval harness — a small, opinionated piece of infrastructure that runs the same set of inputs through every prompt change and reports, deterministically, whether something broke.

This post is about what a useful eval harness actually contains. Not the frameworks; most of them are overkill. Not the consultant-friendly version with a vendor logo on it. The minimum viable shape: a golden dataset, a judge, a runner, and a report — and the honest tradeoffs around what each one costs and where each one lies to you.


What "Eval Harness" Actually Means

The phrase gets used loosely. Three different things tend to live under the same umbrella, and conflating them is how teams end up spending three months building the wrong one.

1. Prompt logging. Capturing every input, output, latency, and token count from production. Useful, cheap, and not an eval harness. Logging tells you what happened. It does not tell you whether what happened was correct.

2. Unit tests for prompts. Small, fast assertions: "the JSON parses," "the response contains an intent field," "no PII leaks through." These catch egregious regressions but say nothing about quality. A response can pass every structural test and still be subtly wrong on content.

3. An eval harness. A reproducible run over a curated set of representative inputs, where each output is graded — by code, by another model, or by a human — against a quality bar. The output is a comparable score across prompt versions, model versions, and configuration changes. This is what catches the silent-regression class of bug.

The four moving parts:

PartWhat it doesCommon mistake
Golden datasetA pinned set of representative inputs with expected outputs (or rubrics)Treating it as static — datasets need to grow as production surfaces new failure modes
RunnerExecutes the system-under-test against every input, captures outputs and metadataCoupling it to a specific framework instead of treating it as a library
JudgeDecides whether each output passes — code, LLM-as-judge, or humanTrusting LLM-as-judge on tasks where it can't actually tell
ReportA diff-able summary of pass/fail/score across runsProducing pretty dashboards but no comparable numbers

The harness is the thing that ties these four parts into a single, repeatable command. If running the eval requires more than one step, it will not be run. That is the most important property to optimize for.


The Golden Dataset — Minimum Viable Version

Most teams stall on the eval harness because they imagine the golden dataset has to be large, exhaustively labeled, and statistically representative before it's worth running. None of those things are true at the start.

A useful first version of a golden dataset is somewhere between 20 and 50 carefully chosen examples. Each example covers a real failure mode, an edge case, or a flagship happy-path query. The selection bias here is a feature, not a bug: the goal is to surface regressions, and regressions tend to cluster around the cases the team already knows are tricky.

Where to source those examples:

  • Production traces. The fastest way. Pull the logs from the last month. Find the queries that already caused incidents, generated complaints, or got escalated. Those are the cases worth pinning.
  • Adversarial brainstorm. Spend an afternoon writing twenty inputs designed to break the system on purpose: ambiguous phrasing, contradictory constraints, languages the prompt didn't anticipate, inputs that look like normal text but contain prompt injection.
  • Boundary cases. For every constraint the prompt asserts ("respond in JSON," "use ≤3 sentences," "decline if the user is asking for medical advice"), include at least one input that tests the boundary on each side.

The expected output for each case can take three forms, in increasing order of looseness:

  1. Exact match. Useful for structural assertions ("returns valid JSON with these keys"), classifications, and intent labels. Cheap to grade, but only covers a narrow slice of LLM tasks.
  2. Rubric. A short list of properties the output must satisfy ("mentions the user's account tier," "does not give legal advice," "tone is neutral"). Graded by an LLM judge or a human.
  3. Reference output. A canonical "good" response. Used as input to a judge that compares the candidate against the reference. Useful for generation tasks where there are many valid outputs but a clear quality bar.

Most production harnesses end up with a mix. Structural checks for the parts that are easy to verify; rubrics for the parts that aren't. Trying to force every case into "exact match" is how teams end up with brittle evals that fail on prompt changes that improved quality.


LLM-as-Judge — When It's Honest, When It's Lying

Once a team accepts that grading every output by hand doesn't scale, the next step is usually LLM-as-judge: a stronger model reviews each candidate output against the rubric and emits a pass/fail or a numeric score. The pitch is appealing — automated, scalable, cheap relative to human grading. The reality is more nuanced.

LLM-as-judge is honest under three conditions:

  1. The judge is materially stronger than the system-under-test. Asking GPT-4o-mini to grade GPT-4o-mini's output is asking the same model to find its own blind spots. Asking Claude Opus 4.6 to grade Sonnet 4.6's output is closer to honest, because the judge has capability headroom.
  2. The rubric is binary or near-binary. "Did the response answer the user's actual question? yes/no." "Did the response include the required disclaimer? yes/no." LLM judges are reliable on yes/no. They get progressively flakier as the rubric becomes a 1-to-5 scale.
  3. The task isn't one the judge is also bad at. If the judging model can't reliably solve the task itself, it can't reliably grade solutions to it. This rules out LLM-as-judge for cutting-edge reasoning, niche domains the model wasn't trained well on, and any task where the ground truth requires verification the judge doesn't have access to.

LLM-as-judge lies when:

  • The judge sycophantically agrees with the candidate. Modern instruct-tuned models have a measurable bias toward "this is fine" when shown an output. If every grade is a pass, the judge is broken — not the system.
  • The rubric is vague. "Is this response helpful?" produces grade variance higher than the regression you're trying to catch.
  • The judge has a known systematic preference. Many models prefer longer, more verbose outputs even when the user wanted concise. Many prefer outputs that match their own style. These biases creep into the eval and reward whichever prompt change happens to align with them.

The robust pattern: use LLM-as-judge for the easy 80%, human review for the hard 20%, and never use LLM-as-judge as the only signal on a metric the team is going to optimize against. Goodhart's law applies — once a judge is the target, prompt engineering will find ways to please the judge without improving real quality.


The Arithmetic of Grading-an-LLM-with-an-LLM

Eval is not free. The temptation is to believe the cost is small because each individual grading call is cheap, and the temptation is wrong. The cost compounds across two dimensions: the size of the dataset, and the frequency of the runs.

A reference setup: 50 golden cases, each producing one candidate response (~1k input tokens, ~500 output tokens) and one judge call (~1.5k input tokens, ~100 output tokens). Run on every prompt-change pull request, plus a nightly run against the latest production prompt, plus an ad-hoc run whenever someone wants to A/B two prompts.

Per-run cost on Claude Sonnet 4.6 pricing (~$3/M input, ~$15/M output):

Candidate generation: 50 * (1000 * $3/M + 500 * $15/M)
                    = 50 * ($0.003 + $0.0075)
                    = $0.525
Judge calls:          50 * (1500 * $3/M + 100 * $15/M)
                    = 50 * ($0.0045 + $0.0015)
                    = $0.30
Total per run:        ~$0.83

That's nothing. A team running this eval ten times a day spends roughly $250 a month — well below the threshold where finance starts asking questions.

But: this assumes 50 cases. A team that grows the dataset to 500 (perfectly reasonable after six months) sees the per-run cost go to $8.30, and the monthly cost at the same cadence to $2,500. Still defensible, but no longer invisible.

And: this assumes a single judge pass. Best-practice patterns (multi-judge consensus, judge-of-the-judge sanity checks, human review on a fraction of cases) multiply that by 2–4×.

The latency math matters too. A 50-case eval takes 5–15 minutes serially, or 30–60 seconds with reasonable concurrency. A 500-case eval takes 50+ minutes serially. If the eval blocks the merge queue, latency budget becomes a real architectural constraint, not a footnote.

The teams that ship eval harnesses successfully tend to make three explicit choices:

  • A small core set runs on every PR (under 60 seconds wall clock). This catches the regressions worth blocking on.
  • A larger nightly run covers the long tail. Failures here open issues but don't block deploys.
  • An ad-hoc pretty-please-grade-this mode runs on demand against a chosen prompt version. Useful for evaluating a specific change, less useful for routine guarding.

This three-tier shape keeps the harness fast where speed matters and thorough where coverage matters, without trying to be both at the same time.


Prompt Versioning + Diffing — Git Is Enough

A common upsell from prompt-management vendors is "prompt versioning with rich diffs and rollback." The pitch is that prompts are now first-class artifacts and need their own dedicated tooling.

For most teams, this is a solved problem with a tool already in their stack. It is called Git.

The pattern that works:

  1. Prompts live in the repository as text files (or as named string constants, with the strings themselves in a versioned config). Not in a database. Not in a SaaS dashboard. Files.
  2. Every prompt change is a normal pull request. Reviewers see exactly what changed, line by line, with the rest of the change.
  3. The eval harness runs in CI on the pull request and posts the results as a comment.
  4. Rollback is a git revert. Auditing is git log. There is no separate dashboard to keep in sync.

What this gives up: the ability for non-engineers to edit prompts in a UI without going through code review. For some products that's a meaningful loss. For most production LLM systems, the prompt is load-bearing infrastructure and should go through code review.

What this gains: prompt history, blame, branching, and review for free, integrated with everything else the team already does. No vendor lock-in, no extra dashboard, no separate access-control model.

When a team genuinely needs more — A/B testing across user cohorts, gradual rollout, runtime configuration — those are usually solved by adding a thin feature-flag layer, not by adopting a prompt-management product. Named prompt configs, gated by a flag, deployed through the normal release pipeline. That's enough.


Three Scenarios

Scenario A — Small team, single LLM feature, low traffic

A startup with one LLM-powered feature serving a few hundred queries a day. Two engineers. No dedicated MLOps person.

Right answer: 20–30 golden cases, structural checks (json.loads() succeeds, required keys present), one LLM-as-judge rubric for the qualitative parts, a CI step that runs the harness on PRs touching the prompt. Total build effort: a long afternoon. Total monthly cost: under $50.

What's wrong here is not having the harness at all. The teams that skip this stage tend to discover regressions by user report and fix them by guessing.

Scenario B — Mid-stage product, several LLM flows, real traffic

A team with three to five distinct LLM-powered surfaces, thousands of queries per day, and someone whose job title contains the word "platform." Multiple engineers ship prompt changes weekly.

Right answer: A shared harness library, golden datasets per flow (50–200 cases each), a tiered run model (PR runs fast, nightly runs thorough), LLM-as-judge with multi-judge consensus on metrics that drive decisions, a small dashboard tracking pass-rate trend over time. Treat the harness as code with the same engineering rigor as the product.

The failure mode at this stage is harness drift: the dataset stops reflecting production reality. Mitigations: a quarterly audit that pulls 20 fresh cases from production traces and adds them to the dataset; a metric on harness pass rate vs. production complaint rate; an explicit owner.

Scenario C — Multi-tenant platform, many LLM features, regulated industry

A platform team supporting many product teams, each shipping LLM features. Regulatory or compliance pressure. The cost of a silent regression is high.

Right answer: All of the above, plus dedicated golden datasets per tenant or per use case, formal acceptance criteria for new prompt versions, human-in-the-loop review for a sampled fraction of every run, and the eval results stored as part of the audit trail. At this scale, it's worth investing in a small purpose-built eval framework rather than gluing pieces together; the maintenance burden of glue starts to dominate.

The framework adoption decision flips here: at small scale, off-the-shelf eval frameworks add overhead with little payoff; at platform scale, they save the equivalent of a small team's worth of engineering hours per quarter.


The Minimum Viable Harness

What follows is the actual structure of a working harness, distilled. Roughly fifty lines of Python — readable, runnable, and the right shape to grow from.

# evals/harness.py
import json
from dataclasses import dataclass
from pathlib import Path

import anthropic

client = anthropic.Anthropic()

@dataclass
class Case:
    id: str
    input: str
    rubric: str        # what makes this output a pass

def load_cases() -> list[Case]:
    raw = json.loads(Path("evals/golden.json").read_text())
    return [Case(**c) for c in raw]

def candidate(prompt: str, user_input: str) -> str:
    """Run the system-under-test against one case."""
    resp = client.messages.create(
        model="claude-sonnet-4-6",
        max_tokens=512,
        system=prompt,
        messages=[{"role": "user", "content": user_input}],
    )
    return resp.content[0].text

def judge(rubric: str, candidate_output: str) -> tuple[bool, str]:
    """Ask a stronger model whether the candidate satisfies the rubric."""
    judge_prompt = (
        "You are evaluating an AI assistant's output against a rubric. "
        "Reply with exactly one of: PASS or FAIL on the first line, "
        "followed by one sentence of justification on the second line."
    )
    user = f"RUBRIC:\n{rubric}\n\nCANDIDATE OUTPUT:\n{candidate_output}"
    resp = client.messages.create(
        model="claude-opus-4-7",
        max_tokens=200,
        system=judge_prompt,
        messages=[{"role": "user", "content": user}],
    )
    text = resp.content[0].text.strip()
    verdict = text.splitlines()[0].upper()
    return verdict == "PASS", text

def run(prompt_path: str) -> dict:
    prompt = Path(prompt_path).read_text()
    cases = load_cases()
    results = []
    for c in cases:
        out = candidate(prompt, c.input)
        ok, reason = judge(c.rubric, out)
        results.append({"id": c.id, "pass": ok, "reason": reason})
    n_pass = sum(r["pass"] for r in results)
    return {
        "prompt": prompt_path,
        "n_cases": len(cases),
        "n_pass": n_pass,
        "pass_rate": n_pass / len(cases),
        "results": results,
    }

if __name__ == "__main__":
    import sys
    report = run(sys.argv[1])
    print(json.dumps(report, indent=2))

That's the whole shape. The pieces missing from this version, deliberately:

  • Concurrency. A real harness runs cases in parallel. asyncio.gather over the 50 cases brings wall time from minutes to seconds.
  • Retries and timeouts. Production-grade — important, but boilerplate.
  • Structured logging and tracing. Critical at scale, but the harness works without it.
  • A diff against the previous run. The single most useful feature once there are at least two runs to compare. CI integration usually handles this.

The shape is what matters. Once those four parts are wired up, the team has gone from "we vibe-check prompts" to "every prompt change has a comparable score." That step is most of the win. Everything else is polish.


The Rule Before You Ship a Prompt Change

Before merging any non-trivial prompt change, run through this:

  1. Did the eval harness run on this change? If not, why not. There should be a real answer, not just "it's a small change." Most regressions come from changes that looked small.
  2. What did the pass rate change? If it dropped, what cases broke and is the failure intentional? If it went up, by how much, and on what — generic improvements, or a single rubric the prompt now happens to satisfy?
  3. Are there cases the harness doesn't cover that this change might affect? If yes, add them to the dataset before merging. Datasets that grow only after incidents are how silent regressions accumulate.
  4. What's the rollback plan if this change ships and quality regresses in production? "Revert the commit and redeploy" is a fine answer. "Hope nobody notices" is not.

A team that runs through these four questions on every prompt change ships measurably fewer silent regressions. Not because the harness is magic — because the discipline of asking the questions surfaces the cases the team would otherwise have skipped.


Closing

Every LLM feature is, eventually, a prompt that has to evolve. The teams that treat the prompt as code — version-controlled, reviewed, graded against representative cases, scored before merge — keep their feature working as the model, the data, and the requirements drift around it. The teams that don't, find out by user report.

An eval harness is not exotic. It is a small dataset, a runner, a judge, and a report, glued together in roughly fifty lines of code. The payoff is not a spike of insight. It's the absence of a recurring class of bug — the silent prompt regression — and the ability to ship prompt changes with the same confidence as a normal code change.

The framing that lands is the one borrowed from every other piece of production engineering: if it's not tested, it's not shipped. Vibes are not tests. Build the harness.

Stop Vibe-Checking Your Prompts - Building an Eval Harness That Catches Regressions Before Your Users Do | Vahid Aghajani