Why Your LLM Keeps Returning Garbage JSON (And How to Stop It)
Every LLM-powered feature breaks the same way in production: the model returns almost-JSON. Markdown fences, trailing commas, a chatty preamble, a missing closing brace. Here's the 3-layer fix that ships — native structured outputs, Pydantic validation, and json_repair + retry loops.
You wire up an LLM call. The demo is magical. You ship it.
The next morning Sentry is on fire. json.JSONDecodeError: Expecting value: line 1 column 1 (char 0). You open the failed payload and the model has politely returned:
Sure! Here's the JSON you asked for:
```json
{
"name": "Acme Corp",
"founded": 1998,
}
```
Let me know if you need anything else!
Three things broke at once: a chatty preamble, a markdown code fence, and a trailing comma. Every team building on LLMs hits this within the first week of going to production. The fix isn't one trick — it's three layers, each catching what the previous one misses.
This post is the layered playbook: native structured-output APIs first, typed validation second, repair-and-retry third. It's the same pattern shipping in our own production code today.
Why LLMs Fail at JSON in the First Place
A language model doesn't output JSON. It outputs the most likely next token, repeatedly. JSON is just a particular sequence of tokens it has seen a lot during training. So whenever the prompt is even slightly ambiguous about format — or the model is fine-tuned to be helpful and conversational — those token probabilities drift toward natural language.
Common failure modes, ranked by frequency:
- Wrapped in markdown —
```json ... ```because the training data is full of code blocks. - Chatty preamble or trailing chatter — "Here's your JSON:", "Hope this helps!".
- Trailing commas — JSON forbids them, JavaScript and Python don't.
- Single quotes — looks like JSON, isn't.
- Unescaped quotes inside strings — "He said "hi"".
- Truncation mid-object — token limit hit, last brace missing.
- Hallucinated fields — extra keys you didn't ask for; missing required ones.
- Wrong types —
"founded": "1998"(string) when you asked for an integer.
You can't prompt your way out of all of these. You need the model to be constrained, the output to be typed, and a fallback for when both still fail.
Layer 1: Use the Native Structured-Output API
Every major provider now ships a way to constrain decoding to a schema. Use it. This single change kills 80–90% of the failure modes above.
OpenAI — response_format={"type": "json_schema", ...}
The model is forced to produce tokens that satisfy the schema. No prose, no fences.
from openai import OpenAI
from pydantic import BaseModel
class Company(BaseModel):
name: str
founded: int
industry: str
client = OpenAI()
resp = client.beta.chat.completions.parse(
model="gpt-4o-2024-08-06",
messages=[{"role": "user", "content": "Extract from: 'Acme Corp, founded 1998, makes industrial widgets.'"}],
response_format=Company,
)
company: Company = resp.choices[0].message.parsed
parsed is already a typed Pydantic object. No json.loads. No regex. No prayer.
Set
temperature=0whenever structured outputs are on. Creativity inside a schema doesn't make the output better — it makes the model more likely to invent fields, drift toward the edges of an enum, or produce values that pass the schema but fail your validators. Save the temperature for prose generation.
Anthropic — Tool use as a schema enforcer
Claude doesn't have a response_format field, but tool definitions act the same way: declare a tool with an input_schema, force the model to call it, and the input you get back is schema-conformant JSON.
import anthropic
client = anthropic.Anthropic()
resp = client.messages.create(
model="claude-sonnet-4-6",
max_tokens=1024,
tools=[{
"name": "extract_company",
"description": "Extract structured company data.",
"input_schema": Company.model_json_schema(),
}],
tool_choice={"type": "tool", "name": "extract_company"},
messages=[{"role": "user", "content": "Acme Corp, founded 1998, makes industrial widgets."}],
)
company = Company.model_validate(resp.content[0].input)
tool_choice forces the model to call the tool, which forces schema compliance.
Gemini — response_mime_type + response_schema
Gemini is especially strong at enforcing enums during decoding. If a field has a fixed set of valid values, declare it as a Literal — Gemini will physically refuse to emit a token outside that set, killing a whole class of Layer-2 validation failures before they happen.
from typing import Literal
from google import genai
from google.genai import types
from pydantic import BaseModel
class CompanyStatus(BaseModel):
name: str
founded: int
industry: Literal["software", "manufacturing", "finance", "other"]
status: Literal["active", "acquired", "defunct"]
client = genai.Client()
resp = client.models.generate_content(
model="gemini-2.5-flash",
contents="Acme Corp, founded 1998, makes industrial widgets. Still trading.",
config=types.GenerateContentConfig(
response_mime_type="application/json",
response_schema=CompanyStatus,
temperature=0,
),
)
company = CompanyStatus.model_validate_json(resp.text)
Rule of thumb: if the API has a structured-output mode, use it. Don't hand-roll prompts that say "return only valid JSON, no other text" — that worked in 2023 and it still kind of works in 2026, but it's strictly worse than constrained decoding.
Layer 2: Validate With Pydantic (Even If Layer 1 Worked)
Constrained decoding gives you syntactic JSON. It does not give you semantic correctness. The model can still:
- Return
founded: 1when you wanted a 4-digit year. - Return an empty
name. - Return a plausible-but-wrong industry.
- Return all-nulls because the source text didn't actually contain the fields.
Pydantic catches these at the boundary, before bad data enters your domain logic:
from pydantic import BaseModel, Field, field_validator
from typing import Literal
class Company(BaseModel):
name: str = Field(min_length=1, max_length=200)
founded: int = Field(ge=1800, le=2030)
industry: Literal["software", "manufacturing", "finance", "other"]
@field_validator("name")
@classmethod
def name_not_placeholder(cls, v: str) -> str:
if v.strip().lower() in {"unknown", "n/a", "none", ""}:
raise ValueError("name looks like a placeholder")
return v
Validation failures here are useful signal, not just errors. If founded=1 keeps tripping the validator, your prompt is ambiguous — fix the prompt. If industry="other" appears too often, your enum is too narrow — fix the schema.
The pattern: every LLM call returns into a Pydantic model, and the rest of your codebase only ever sees validated objects. Treat the LLM like a third-party API you don't trust.
Layer 3: Repair and Retry When Both Layers Above Fail
For ~95% of calls, layers 1 and 2 are enough. The remaining 5% — long inputs, weird edge cases, model degradation, rate-limit retries that hit a different model version — still fails. You need a fallback.
json_repair for almost-JSON
json-repair is a small library that fixes the common malformations: trailing commas, single quotes, missing closing braces, markdown fences, prose-around-JSON.
from json_repair import repair_json
import json
raw = response.text # the model's raw string
try:
data = json.loads(raw)
except json.JSONDecodeError:
data = json.loads(repair_json(raw))
It's not magic — it's a forgiving parser. It will succeed on inputs that strict JSON refuses, and it has saved more production calls than any single prompt change.
Retry with the validation error fed back to the model
If repair_json fails and Pydantic validation fails, retry with the error message in the next prompt. The model is genuinely good at fixing its own mistakes when you tell it what broke:
def call_with_repair(messages, schema_cls, max_retries=2):
for attempt in range(max_retries + 1):
resp = call_llm(messages) # native structured-output call
try:
return schema_cls.model_validate_json(resp.text)
except ValidationError as e:
if attempt == max_retries:
raise
messages.append({"role": "assistant", "content": resp.text})
messages.append({
"role": "user",
"content": f"Your previous response failed validation: {e}\nFix it and return only valid JSON matching the schema.",
})
Two retries is the sweet spot. One isn't enough for sticky failures; three burns tokens for almost no extra success rate.
Retries are quadratic in token cost — use prompt caching to flatten the curve. A 50K-token prompt that retries twice is 150K billed tokens at full price unless you cache. OpenAI, Anthropic, and Gemini all ship prompt caching in 2026; the second and third attempts should hit the cached prefix at a fraction of the cost (typically 10–25%). Cache the system prompt + the source document, vary only the validation-error feedback message.
Make failure structured, too
When all retries are exhausted, don't raise Exception("LLM failed"). Raise a typed exception your caller can branch on:
class InsufficientDataError(Exception):
"""The source material genuinely didn't contain the requested fields."""
class SchemaViolation(Exception):
"""The model couldn't conform to the schema after retries."""
Different failures deserve different handling. A SchemaViolation is a model/prompt problem — log it, alert. An InsufficientDataError is a data problem — surface it to the user as "we couldn't extract X from this document", not as a 500.
Putting It All Together
Here's the full pattern, condensed:
def extract_company(text: str) -> Company:
messages = [
{"role": "system", "content": "Extract structured company data from the text."},
{"role": "user", "content": text},
]
return call_with_repair(messages, schema_cls=Company, max_retries=2)
Three layers, one entry point. The caller never sees a JSONDecodeError. They get a Company or a typed exception they can handle.
Decision Matrix
| You're calling… | Use |
|---|---|
| OpenAI GPT-4o or newer | response_format=PydanticModel (.parse() API) |
| Anthropic Claude | Tool with forced tool_choice + model_json_schema() |
| Gemini 2.5+ | response_mime_type="application/json" + response_schema=PydanticModel |
| OpenRouter wrapper | Whatever the underlying model supports — check, don't assume |
| Local Llama/Mistral via vLLM | Outlines or LM Format Enforcer for grammar-constrained decoding |
| Anything older or weirder | Plain prompt + json_repair + Pydantic + retry loop |
And one more axis: model size
| Model class | Strategy |
|---|---|
| Frontier (GPT-4o, Claude Opus 4.x, Gemini 2.5 Pro) | Layers 1 + 2 are usually enough. Layer 3 catches the long tail. |
| Small / edge (Gemini Flash, Llama 3.x 8B, Phi-4, Mistral 7B) | Layer 3 is mandatory. Small models trip on nested schemas, Optional fields, and long enums far more often. Budget for 2–5% retry rate even with structured outputs on. |
Gotchas Nobody Tells You
- Schemas with
Optional[T]get filled withNoneaggressively. The model treats "I don't know" as a valid answer when nullability is allowed. If you need extraction to be honest about missing data, use a typed exception path instead. - Enums with too many values regress to "other". Keep
Literal[...]lists short. If you need 50 categories, use a two-step pipeline: free-text → embedding → nearest enum. additionalProperties: falsematters. Without it, the model invents fields. Pydantic v2 emits this by default; double-check if you write the JSON Schema by hand.- Streaming + structured outputs is half-broken everywhere. You can stream the JSON, but you can't typed-parse it until the stream finishes. Don't promise users a typewriter effect on extracted data.
- JSON mode is not free. Constrained decoding adds 10–30% latency on most providers. Worth it for correctness, but budget for it.
- Retries cost money quadratically when prompts are long. A 50K-token prompt that retries twice is 150K tokens. Cache the system prompt aggressively.
Closing
The "LLM-as-software-component" problem isn't solved by a smarter model. It's solved by treating the model like every other unreliable upstream service: constrain what it can return, validate what it does return, repair what's almost right, retry with feedback, fail loudly with typed errors when all else breaks.
Three layers. Each one catches what the previous one missed. Ship that and your Sentry inbox stops lighting up at 6 a.m.
Next post in the series: how to measure whether your prompt actually works — golden datasets, LLM-as-judge, and the eval harness most teams skip.