MLOps Part 1: Model Deployment and Serving

Learn the fundamentals of deploying and serving machine learning models in production environments, including deployment strategies, serving architectures, and best practices

Introduction to Model Deployment

Deploying a machine learning model to production is one of the most critical steps in the ML lifecycle. It's where your model transitions from development to delivering real business value.

What You'll Learn

In this module, you'll understand:

  • Deployment fundamentals: What it means to deploy an ML model
  • Serving architectures: Different ways to serve predictions
  • Deployment strategies: How to safely roll out models
  • Best practices: Industry standards for production ML systems

Why Model Deployment Matters

Many great ML models never make it to production. According to industry surveys:

  • 87% of ML projects never make it past the prototype stage
  • The primary blocker is deployment complexity
  • Production ML requires skills beyond model training

Let's bridge that gap!

---PAGE---

Understanding Model Serving

Model serving is the process of making your trained model available to make predictions on new data in a production environment.

Key Concepts

Inference: The process of using a trained model to make predictions on new, unseen data.

Serving: The infrastructure and systems that host your model and handle inference requests.

Common Serving Patterns

1. Batch Prediction

  • Process large volumes of data at scheduled intervals
  • Examples: Nightly recommendation updates, weekly churn predictions
  • Pros: Simple, efficient for large datasets
  • Cons: Not real-time, can't handle immediate requests

2. Real-time (Online) Prediction

  • Respond to individual requests instantly
  • Examples: Fraud detection, content recommendation at click-time
  • Pros: Immediate results, personalized responses
  • Cons: Higher infrastructure costs, latency requirements

3. Streaming Prediction

  • Continuous processing of streaming data
  • Examples: Anomaly detection in sensor data, real-time monitoring
  • Pros: Low latency, handles continuous data
  • Cons: Complex infrastructure, harder to debug

---PAGE---

Deployment Architectures

REST API Deployment

The most common pattern for serving ML models is through REST APIs.

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load model at startup
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    features = data['features']

    # Make prediction
    prediction = model.predict([features])

    return jsonify({
        'prediction': prediction.tolist(),
        'model_version': '1.0.0'
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Key Components:

  • Endpoint: /predict accepts POST requests
  • Input validation: Check feature format and types
  • Model inference: Call model.predict()
  • Response formatting: Return JSON with prediction

Containerized Deployment

Containers ensure consistency across environments.

FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model.pkl .
COPY app.py .

# Expose port
EXPOSE 5000

# Run application
CMD ["python", "app.py"]

Benefits:

  • Reproducible environments
  • Easy scaling (multiple containers)
  • Platform independent
  • Version control for entire stack

---PAGE---

Deployment Strategies

1. Blue-Green Deployment

Run two identical production environments (Blue and Green). One serves live traffic while the other is idle.

Process:

  1. Blue environment serves production traffic
  2. Deploy new model to Green environment
  3. Test Green environment thoroughly
  4. Switch traffic from Blue to Green
  5. Keep Blue as rollback option

Advantages:

  • Zero downtime
  • Easy rollback
  • Full testing before switch

Disadvantages:

  • 2x infrastructure cost
  • Requires load balancer

2. Canary Deployment

Gradually roll out new model to small percentage of users.

Process:

  1. Deploy new model alongside old model
  2. Route 5% of traffic to new model
  3. Monitor metrics (accuracy, latency, errors)
  4. Gradually increase to 25%, 50%, 100%
  5. Rollback if issues detected

Advantages:

  • Reduced risk
  • Real-world validation
  • Progressive rollout

Disadvantages:

  • Complex routing logic
  • Longer deployment time
  • Need robust monitoring

3. Shadow Deployment

New model receives real traffic but predictions aren't used.

Process:

  1. Deploy new model in shadow mode
  2. Send copy of production traffic to new model
  3. Compare predictions with current model
  4. Collect metrics without affecting users
  5. Promote to production when confident

Advantages:

  • Zero user impact
  • Real production data
  • Side-by-side comparison

Disadvantages:

  • Requires duplicate infrastructure
  • Only validates predictions, not full system

---PAGE---

Model Serving Frameworks

Several frameworks simplify model deployment:

TensorFlow Serving

Designed for serving TensorFlow models at scale.

# Save model in SavedModel format
import tensorflow as tf

model.save('my_model/1/')  # Version 1

# Start TensorFlow Serving
# docker run -p 8501:8501 \
#   --mount type=bind,source=/path/to/my_model,target=/models/my_model \
#   -e MODEL_NAME=my_model \
#   tensorflow/serving

# Make prediction request
import requests
import json

data = json.dumps({
    "signature_name": "serving_default",
    "instances": [[5.0, 2.0, 3.5, 1.0]]
})

response = requests.post(
    'http://localhost:8501/v1/models/my_model:predict',
    data=data
)

Features:

  • High performance (C++ core)
  • Model versioning
  • Batching support
  • gRPC and REST APIs

FastAPI for ML

Modern Python framework optimized for APIs.

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load('model.pkl')

class PredictionInput(BaseModel):
    features: list[float]

class PredictionOutput(BaseModel):
    prediction: float
    confidence: float

@app.post("/predict", response_model=PredictionOutput)
async def predict(input_data: PredictionInput):
    prediction = model.predict([input_data.features])[0]
    confidence = model.predict_proba([input_data.features]).max()

    return PredictionOutput(
        prediction=prediction,
        confidence=confidence
    )

Benefits:

  • Automatic API documentation
  • Type validation with Pydantic
  • Async support
  • High performance

Seldon Core

Kubernetes-native ML deployment platform.

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: iris-model
spec:
  predictors:
  - name: default
    replicas: 3
    componentSpecs:
    - spec:
        containers:
        - name: classifier
          image: my-model:1.0.0
    graph:
      name: classifier
      type: MODEL

Features:

  • Auto-scaling
  • A/B testing built-in
  • Model monitoring
  • Multiple model servers (TensorFlow, PyTorch, Scikit-learn)

---PAGE---

Best Practices for Production ML

1. Model Versioning

Always version your models:

import mlflow

# Log model with version
with mlflow.start_run():
    mlflow.sklearn.log_model(
        model,
        "model",
        registered_model_name="fraud_detector"
    )
    mlflow.log_param("algorithm", "random_forest")
    mlflow.log_metric("accuracy", 0.95)

2. Input Validation

Validate all inputs before inference:

from pydantic import BaseModel, validator

class InputData(BaseModel):
    age: int
    income: float

    @validator('age')
    def age_must_be_positive(cls, v):
        if v < 0 or v > 120:
            raise ValueError('Age must be between 0 and 120')
        return v

    @validator('income')
    def income_must_be_reasonable(cls, v):
        if v < 0:
            raise ValueError('Income cannot be negative')
        return v

3. Monitoring and Logging

Track key metrics:

import logging
from prometheus_client import Counter, Histogram

# Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Metrics
prediction_counter = Counter('predictions_total', 'Total predictions')
prediction_latency = Histogram('prediction_latency_seconds', 'Prediction latency')

@app.post("/predict")
async def predict(data: InputData):
    with prediction_latency.time():
        result = model.predict([data.dict().values()])
        prediction_counter.inc()

    logger.info(f"Prediction made: {result}")
    return {"prediction": result.tolist()}

4. Error Handling

Gracefully handle errors:

from fastapi import HTTPException

@app.post("/predict")
async def predict(data: InputData):
    try:
        prediction = model.predict([data.features])
        return {"prediction": prediction.tolist()}
    except ValueError as e:
        logger.error(f"Invalid input: {e}")
        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        logger.error(f"Prediction failed: {e}")
        raise HTTPException(status_code=500, detail="Internal server error")

5. Health Checks

Implement health endpoints:

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "model_loaded": model is not None,
        "version": "1.0.0"
    }

@app.get("/ready")
async def readiness_check():
    # Check if model is loaded and ready
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    return {"status": "ready"}

---PAGE---

Performance Optimization

1. Batching Requests

Process multiple predictions together:

from fastapi import FastAPI
from typing import List

@app.post("/predict/batch")
async def predict_batch(inputs: List[InputData]):
    features = [list(item.dict().values()) for item in inputs]
    predictions = model.predict(features)

    return {
        "predictions": predictions.tolist(),
        "count": len(predictions)
    }

Benefits:

  • Better GPU/CPU utilization
  • Reduced overhead
  • Higher throughput

2. Model Caching

Cache frequently requested predictions:

from functools import lru_cache
import hashlib
import json

# Simple in-memory cache
prediction_cache = {}

def get_cache_key(features):
    return hashlib.md5(json.dumps(features).encode()).hexdigest()

@app.post("/predict")
async def predict(data: InputData):
    cache_key = get_cache_key(data.features)

    if cache_key in prediction_cache:
        return prediction_cache[cache_key]

    prediction = model.predict([data.features])
    result = {"prediction": prediction.tolist()}

    prediction_cache[cache_key] = result
    return result

3. Async Processing

For long-running predictions:

from fastapi import BackgroundTasks
import uuid

jobs = {}

@app.post("/predict/async")
async def predict_async(data: InputData, background_tasks: BackgroundTasks):
    job_id = str(uuid.uuid4())
    jobs[job_id] = {"status": "pending"}

    background_tasks.add_task(run_prediction, job_id, data)

    return {"job_id": job_id}

def run_prediction(job_id: str, data: InputData):
    try:
        prediction = model.predict([data.features])
        jobs[job_id] = {
            "status": "completed",
            "prediction": prediction.tolist()
        }
    except Exception as e:
        jobs[job_id] = {"status": "failed", "error": str(e)}

@app.get("/predict/async/{job_id}")
async def get_prediction_result(job_id: str):
    if job_id not in jobs:
        raise HTTPException(status_code=404, detail="Job not found")
    return jobs[job_id]

---PAGE---

Security Considerations

1. Authentication

Protect your endpoints:

from fastapi import Depends, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
    token = credentials.credentials
    if token != "your-secret-token":  # Use proper auth in production
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Invalid authentication credentials"
        )
    return token

@app.post("/predict")
async def predict(data: InputData, token: str = Depends(verify_token)):
    prediction = model.predict([data.features])
    return {"prediction": prediction.tolist()}

2. Rate Limiting

Prevent abuse:

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/predict")
@limiter.limit("100/minute")
async def predict(request: Request, data: InputData):
    prediction = model.predict([data.features])
    return {"prediction": prediction.tolist()}

3. Input Sanitization

Protect against malicious inputs:

import numpy as np

def sanitize_features(features: list) -> list:
    # Check for NaN or Inf
    if not all(np.isfinite(features)):
        raise ValueError("Features contain NaN or Inf values")

    # Check feature count
    if len(features) != expected_feature_count:
        raise ValueError(f"Expected {expected_feature_count} features")

    # Clip outliers
    features = np.clip(features, feature_min, feature_max)

    return features.tolist()

---PAGE---

---QUIZ--- TITLE: Model Deployment Knowledge Check INTRO: Let's test your understanding of model deployment and serving concepts!

Q: Which deployment pattern processes predictions at scheduled intervals rather than in real-time? A: Batch Prediction A: Real-time Prediction A: Streaming Prediction A: Shadow Deployment CORRECT: 0 EXPLAIN: Batch prediction processes large volumes of data at scheduled intervals (e.g., nightly or weekly), making it ideal for use cases that don't require immediate results like recommendation updates or periodic churn predictions.

Q: In a Blue-Green deployment strategy, what is the main advantage? A: Lower infrastructure costs A: Faster deployment time A: Zero downtime and easy rollback A: Automatic scaling CORRECT: 2 EXPLAIN: Blue-Green deployment maintains two identical environments, allowing you to switch traffic between them instantly, resulting in zero downtime. If issues arise, you can quickly rollback by switching back to the previous environment.

Q: What is the purpose of a Canary deployment? A: To test the model on synthetic data only A: To gradually roll out changes to a small percentage of users first A: To deploy multiple model versions simultaneously A: To cache prediction results CORRECT: 1 EXPLAIN: Canary deployment gradually routes a small percentage of production traffic (e.g., 5%) to the new model first. This allows you to validate the new model with real users while limiting risk. You can progressively increase traffic or rollback if issues are detected.

Q: Which of the following is NOT a key component of production-ready model serving? A: Input validation A: Health check endpoints A: Model versioning A: Training data storage CORRECT: 3 EXPLAIN: Training data storage is part of the model development phase, not model serving. Production model serving requires input validation (ensure valid data), health checks (monitor service status), and model versioning (track which model is deployed), but doesn't need access to training data.

Q: What is the primary benefit of using Docker containers for model deployment? A: Faster model training A: Better model accuracy A: Reproducible environments across different platforms A: Automatic hyperparameter tuning CORRECT: 2 EXPLAIN: Docker containers package your model, code, and all dependencies into a single unit that runs consistently across different environments (development, staging, production). This ensures reproducibility and eliminates "it works on my machine" problems.

Q: In the context of model serving, what does "inference" mean? A: Training a new model on production data A: Using a trained model to make predictions on new data A: Evaluating model performance on test data A: Updating model parameters CORRECT: 1 EXPLAIN: Inference is the process of using a trained model to make predictions on new, unseen data in production. It's distinct from training (building the model) and evaluation (testing model performance).

Q: Which metric is most important to monitor for real-time prediction services? A: Training accuracy A: Prediction latency A: Dataset size A: Number of features CORRECT: 1 EXPLAIN: For real-time prediction services, latency (response time) is critical because users expect immediate results. While accuracy is important, if your service is too slow, it won't meet real-time requirements. Training metrics and data characteristics are less relevant for production serving.

Q: What is the purpose of implementing rate limiting on a model serving endpoint? A: To improve model accuracy A: To prevent abuse and control infrastructure costs A: To speed up inference A: To cache predictions CORRECT: 1 EXPLAIN: Rate limiting restricts how many requests a client can make within a time window. This prevents abuse (intentional or accidental), protects your infrastructure from overload, and helps control costs by limiting resource usage. ---END-QUIZ---

---PAGE---

Key Takeaways

Congratulations! You've completed Part 1 of the MLOps series. Here's what you learned:

Core Concepts

Model Serving: The process of making trained models available for predictions in production

Serving Patterns: Batch, real-time, and streaming prediction approaches

Deployment Strategies: Blue-green, canary, and shadow deployments for safe rollouts

Technical Skills

REST APIs: Building prediction endpoints with Flask and FastAPI

Containerization: Using Docker for reproducible deployments

Frameworks: TensorFlow Serving, FastAPI, and Seldon Core

Best Practices

Monitoring: Logging predictions and tracking performance metrics

Validation: Checking inputs before inference

Security: Authentication, rate limiting, and input sanitization

Performance: Batching, caching, and async processing

Next Steps

In Part 2: Model Monitoring and Observability, you'll learn:

  • How to detect model drift in production
  • Monitoring prediction quality and data quality
  • Setting up alerts for model degradation
  • Building dashboards for ML systems

In Part 3: CI/CD for ML, you'll discover:

  • Automated testing for ML models
  • Building ML pipelines
  • Continuous training and deployment
  • Version control for models and data

In Part 4: Scaling ML Systems, you'll explore:

  • Horizontal and vertical scaling strategies
  • Load balancing for ML services
  • Distributed inference
  • Cost optimization

Resources for Further Learning

Keep practicing, and see you in Part 2!

MLOps Part 1: Model Deployment and Serving | Software Engineer Blog