Build Your Own Model Registry in a Weekend — FastAPI + Postgres, No MLflow

A model registry has a reputation problem. The phrase summons MLflow dashboards, S3 artifact stores, server processes that need their own backup strategy, and a setup wiki page that nobody on the team can fully reproduce six months later.

For a small team shipping a single ML product, that picture is wildly out of scale with the actual job. The actual job is small: keep track of which models were trained, what they were trained on, how well they did, and which weights file goes with which row in the database. That is a Postgres table and a few endpoints. It is genuinely a weekend.

This post walks through the pattern from a real labeling platform that trains YOLOv8 detection models — the table shape, the auto-increment version trick, the dynamic-metric comparison endpoint, and the three signals that say the custom version has finally outgrown its weekend-project clothes.

Slide deck — 14 slides

The whole argument compressed into a deck. Use ◂ ▸ or arrow keys.

Slide 1 / 14

What a Model Registry Actually Does

The MLOps maturity model talks about registries like they're one thing. They aren't. There are four jobs hiding inside the term, and a small team only needs the first three:

Identity. Every trained model gets a stable ID. Given any model in production, you can recover what code, data, and config produced it.
Lineage. Each model points back to the dataset it was trained on, the parent model it was fine-tuned from, and the run that produced it.
Comparison. Given two or more models, you can pull their metrics side by side and answer "did the new one improve over the old one."
Serving. A registry-aware inference service can switch which version it serves by changing a row, not by redeploying.

Job #4 is where MLflow earns its keep — model serving, A/B routing, blue-green deploys, audit trails of what served which traffic when. If you're not yet doing those things, you're paying for infrastructure you don't use.

Jobs #1–#3 fit in one Postgres table.

The Minimum Schema

Here is the core table from the labeling platform, lightly trimmed. SQLAlchemy 2.0, Postgres, with JSONB for the parts that change shape:

class TrainingJob(Base):
    __tablename__ = "training_jobs"

    id = Column(UUID(as_uuid=True), primary_key=True, default=uuid.uuid4)
    project_id = Column(UUID, ForeignKey("projects.id"), nullable=False)
    dataset_id = Column(UUID, ForeignKey("datasets.id"), nullable=False)

    model_type = Column(String, nullable=False)   # yolov8n, yolov8s, rtdetr-l ...
    version = Column(Integer, nullable=True)      # auto-incremented per project

    config = Column(JSONB, nullable=True)         # epochs, batch, imgsz, lr ...
    metrics = Column(JSONB, nullable=True)        # mAP50, mAP50-95, precision ...

    status = Column(String, default="pending")    # pending, training, completed, failed
    model_path = Column(String, nullable=True)    # filesystem or S3 key
    model_size = Column(Float, nullable=True)     # MB

    started_at = Column(DateTime, nullable=True)
    completed_at = Column(DateTime, nullable=True)
    created_at = Column(DateTime, default=datetime.utcnow)

Three things are doing real work in this schema:

version is per-project, not global. A user training 5 YOLO models in project A and 3 in project B should see v1..v5 and v1..v3, not a global counter that confuses them.
config and metrics are JSONB, not separate columns. Different model types have different hyperparameters and different metrics. Rigid columns become a migration treadmill.
status is a string, not an enum. Enums look principled and become painful the first time a new state has to be added in production. Strings + a comment cost nothing and bend without breaking.

The model_path is intentionally just a string. It can point to a local disk in dev, an S3 key in production, a Modal volume, or a Hugging Face repo. The registry doesn't care where the bytes live; it cares that the row knows where they live.

The Auto-Increment Version Trick

Per-project version numbers are the single most user-visible feature of a registry. Calling models v1, v2, v3 inside a project is dramatically friendlier than UUIDs. Doing it correctly with raw SQL is one query:

from sqlalchemy import func

@router.post("/{project_id}/train", status_code=201)
def start_training(project_id: str, data: TrainingJobCreate, db: Session = ...):
    project = _get_owned_project(db, project_id, current_user)

    max_version = db.query(func.max(TrainingJob.version)).filter(
        TrainingJob.project_id == project.id
    ).scalar()
    next_version = (max_version or 0) + 1

    job = TrainingJob(
        project_id=project.id,
        model_type=data.model_type,
        dataset_id=data.dataset_id,
        version=next_version,
        config={**DEFAULT_CONFIGS[data.model_type], **(data.config or {})},
        status="pending",
    )
    db.add(job); db.commit(); db.refresh(job)
    run_training_job.delay(str(job.id))   # Celery
    return job

Two things to call out:

This is racy under high concurrency. Two start_training calls landing inside the same millisecond against the same project can both read max_version=4 and both insert v5. For a team of one or two engineers and a single user per project, this never happens in practice. If it does, the fix is a unique constraint on (project_id, version) plus a retry on IntegrityError. Don't add the constraint until you've seen the bug — premature locking is its own form of pain.

Defaults merge with user config. The pattern {**DEFAULT_CONFIGS[model], **user_config} is a tiny piece of ergonomics that pays off forever. Users specify what they care about; the registry fills in the rest; the persisted config is always complete. No "what was the default batch size when this was trained?" archaeology six months later.

The Compare Endpoint — Dynamic Metric Discovery

This is the part where most off-the-shelf registries either over-engineer or under-deliver. The job is: given a list of model versions, return their metrics in a shape a frontend can render as a comparison table. The challenge is that different model types report different metrics — YOLO has mAP50 and mAP50-95, a classifier has top1_acc and top5_acc, an LLM fine-tune has eval_loss and bleu.

Here is the entire endpoint:

@router.get("/{project_id}/models/compare")
def compare_models(project_id: str, job_ids: str = "", db: Session = ...):
    project = _get_owned_project(db, project_id, current_user)

    query = db.query(TrainingJob).filter(
        TrainingJob.project_id == project.id,
        TrainingJob.status == "completed",
    )
    if job_ids:
        ids = [jid.strip() for jid in job_ids.split(",")]
        query = query.filter(TrainingJob.id.in_(ids))

    jobs = query.order_by(TrainingJob.version).all()

    return {
        "models": [
            {
                "id": str(job.id),
                "version": job.version,
                "model_type": job.model_type,
                "config": job.config,
                "metrics": job.metrics,
                "model_size": job.model_size,
                "started_at": job.started_at.isoformat() if job.started_at else None,
                "completed_at": job.completed_at.isoformat() if job.completed_at else None,
            }
            for job in jobs
        ],
        "metric_keys": _collect_metric_keys(jobs),
    }


def _collect_metric_keys(jobs):
    keys = set()
    for job in jobs:
        if job.metrics:
            keys.update(k for k, v in job.metrics.items() if isinstance(v, (int, float)))
    return sorted(keys)

The trick is _collect_metric_keys: walk the JSONB blobs, collect every numeric key that appears anywhere, and return the sorted union. The frontend renders one table column per discovered key, leaving cells empty where a particular model didn't report that metric. No schema migrations when a new metric shows up. No metric registry table to keep in sync. The data tells the UI what to render.

This is the moment where JSONB earns its rent. A relational schema with one column per metric would force a migration the first time someone trained a different model class. The JSONB shape absorbs new metrics for free and the comparison endpoint introspects them on read.

What Else You Get for Free

Three more endpoints fall out of the same table without much extra code, and together they cover the rest of what most teams actually need:

GET /jobs — list every model in a project, ordered by created_at. The "registry view" in the UI.
GET /jobs/{id}/download — return the weights file. Two lines on top of FastAPI's FileResponse. The filename can be derived from the row: {project.name}_v{job.version}_{job.model_type}.pt. Self-describing artifact name with zero extra storage.
GET /models/curves — read each completed job's results.csv from disk and return parsed epoch-level metrics for an overlay chart. Training curves comparison without ever launching TensorBoard.

The point isn't that any of these are clever individually. It's that the cumulative weight of "thirty lines of FastAPI per endpoint, sharing one Postgres table" stays flat as features are added. A registry-as-a-service architecture would have already paid the cost of standing up a separate process before any of this code got written.

When the Custom Version Is Right

The pattern works because the team's actual needs are narrow:

One ML product, possibly with multiple sub-projects, but a single team owning the registry.
Single-tenant access patterns. Each project has an owner. No cross-team sharing, no marketplace, no multi-org governance.
Disk or simple object storage for weights. Nobody needs registry-mediated downloads with row-level audit logs.
Simple promotion flow. "Deploy this version" is a row update, not a workflow.
Comparison happens by humans, in a UI, eyeballing numbers — not by automated CI gates blocking deploys on metric regressions.

If those bullets describe the situation, the custom registry is not a tradeoff. It's the right architecture and MLflow is the wrong one. The "industry standard" framing pushes teams toward tools designed for organizations one or two orders of magnitude larger than they actually are.

When It Isn't Anymore — Three Signals

The custom registry is the right answer until it isn't. The graduation signals are usually:

Signal 1 — multi-tenant becomes real. The first time a customer needs to bring their own models, sees other customers' models, or has compliance requirements about who can read which artifact, the home-grown access-control layer that started as WHERE owner_id = ? starts collecting bolt-ons. At that point, an off-the-shelf registry's pre-built RBAC saves real time. Until then, it's solving a problem you don't have.

Signal 2 — model serving needs blue-green or canary. A row update is a fine deploy mechanism when "deploy" means "the next inference call uses this version." It stops being fine when 1% of traffic should hit the new version, when the routing decision needs to be observable per-request, or when rollback needs to happen in seconds based on production metrics. That's a feature flag plus a serving layer, not a database row.

Signal 3 — automated quality gates in CI. The custom comparison endpoint serves a UI. The day you want CI to refuse to merge a model with a mAP50 regression of more than 2 points, you need the registry to expose machine-friendly thresholds and the deployment pipeline to consume them. The 200-line version doesn't have that surface, and bolting it on starts to feel like reinventing what MLflow already ships.

If none of those three signals are present, the operational cost of MLflow — yet another service to host, back up, upgrade, and authenticate against — is dead weight. The registry-in-Postgres pattern stays well within its competence and a git log is the audit trail.

The Honest Caveats

Two things this pattern does not handle gracefully, and pretending otherwise would be dishonest:

Cross-team model discovery. "Has anyone else in the company already trained a YOLO for warehouse pallets?" requires either a global model search index or a culture where models are always shared via a known channel. Neither is the registry's job, but neither is solved by it.
Reproducibility audits. A row in Postgres tells you what code and config produced a model only if the code and data are themselves versioned and immutable. Pinning the dataset row by ID is necessary but not sufficient — if dataset rows can be edited in place, the lineage is a lie. The registry trusts the rest of the system to be reproducible. If the rest of the system isn't, no registry can rescue it.

For most products in the small-team stage, both of these caveats describe problems that aren't real yet. They're worth flagging so the trapdoor is visible before someone falls through it.

What to Build First

A pragmatic build order, end-to-end, for a team starting from scratch:

The table. TrainingJob (or whatever your training unit is) with id, version, config (JSONB), metrics (JSONB), model_path, status, FK to project, timestamps. Alembic migration to create it.
The create endpoint. POST /train with the per-project version increment. Kicks off a background job.
The list and detail endpoints. Boring CRUD. Renders the registry UI.
The compare endpoint with dynamic metric keys. This is the one that pays for itself within the first two trained models.
The download endpoint. FileResponse with a self-describing filename.
A deploy/promote field. A is_deployed boolean or a separate Deployment table — depends on whether you ever need history. Start with the boolean.

Total: roughly 200 lines of FastAPI, one Alembic migration, and the comparison endpoint that does the dynamic-key trick. The whole thing is reviewable in one sitting and runs on the same Postgres the rest of the application uses.

That is what a model registry can be when the team stops outsourcing the decision to a tool category. Most of the time, that is what it should be.