MLOps Part 1: Model Deployment and Serving

Introduction to Model Deployment

Deploying a machine learning model to production is one of the most critical steps in the ML lifecycle. It's where your model transitions from development to delivering real business value.

What You'll Learn

In this module, you'll understand:

Deployment fundamentals: What it means to deploy an ML model
Serving architectures: Different ways to serve predictions
Deployment strategies: How to safely roll out models
Best practices: Industry standards for production ML systems

Why Model Deployment Matters

Many great ML models never make it to production. According to industry surveys:

87% of ML projects never make it past the prototype stage
The primary blocker is deployment complexity
Production ML requires skills beyond model training

Let's bridge that gap!

---PAGE---

Understanding Model Serving

Model serving is the process of making your trained model available to make predictions on new data in a production environment.

Key Concepts

Inference: The process of using a trained model to make predictions on new, unseen data.

Serving: The infrastructure and systems that host your model and handle inference requests.

Common Serving Patterns

1. Batch Prediction

Process large volumes of data at scheduled intervals
Examples: Nightly recommendation updates, weekly churn predictions
Pros: Simple, efficient for large datasets
Cons: Not real-time, can't handle immediate requests

2. Real-time (Online) Prediction

Respond to individual requests instantly
Examples: Fraud detection, content recommendation at click-time
Pros: Immediate results, personalized responses
Cons: Higher infrastructure costs, latency requirements

3. Streaming Prediction

Continuous processing of streaming data
Examples: Anomaly detection in sensor data, real-time monitoring
Pros: Low latency, handles continuous data
Cons: Complex infrastructure, harder to debug

---PAGE---

Deployment Architectures

REST API Deployment

The most common pattern for serving ML models is through REST APIs.

from flask import Flask, request, jsonify
import joblib

app = Flask(__name__)

# Load model at startup
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    features = data['features']

    # Make prediction
    prediction = model.predict([features])

    return jsonify({
        'prediction': prediction.tolist(),
        'model_version': '1.0.0'
    })

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=5000)

Key Components:

Endpoint: /predict accepts POST requests
Input validation: Check feature format and types
Model inference: Call model.predict()
Response formatting: Return JSON with prediction

Containerized Deployment

Containers ensure consistency across environments.

FROM python:3.9-slim

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy model and code
COPY model.pkl .
COPY app.py .

# Expose port
EXPOSE 5000

# Run application
CMD ["python", "app.py"]

Benefits:

Reproducible environments
Easy scaling (multiple containers)
Platform independent
Version control for entire stack

---PAGE---

Deployment Strategies

1. Blue-Green Deployment

Run two identical production environments (Blue and Green). One serves live traffic while the other is idle.

Process:

Blue environment serves production traffic
Deploy new model to Green environment
Test Green environment thoroughly
Switch traffic from Blue to Green
Keep Blue as rollback option

Advantages:

Zero downtime
Easy rollback
Full testing before switch

Disadvantages:

2x infrastructure cost
Requires load balancer

2. Canary Deployment

Gradually roll out new model to small percentage of users.

Process:

Deploy new model alongside old model
Route 5% of traffic to new model
Monitor metrics (accuracy, latency, errors)
Gradually increase to 25%, 50%, 100%
Rollback if issues detected

Advantages:

Reduced risk
Real-world validation
Progressive rollout

Disadvantages:

Complex routing logic
Longer deployment time
Need robust monitoring

3. Shadow Deployment

New model receives real traffic but predictions aren't used.

Process:

Deploy new model in shadow mode
Send copy of production traffic to new model
Compare predictions with current model
Collect metrics without affecting users
Promote to production when confident

Advantages:

Zero user impact
Real production data
Side-by-side comparison

Disadvantages:

Requires duplicate infrastructure
Only validates predictions, not full system

---PAGE---

Model Serving Frameworks

Several frameworks simplify model deployment:

TensorFlow Serving

Designed for serving TensorFlow models at scale.

# Save model in SavedModel format
import tensorflow as tf

model.save('my_model/1/')  # Version 1

# Start TensorFlow Serving
# docker run -p 8501:8501 \
#   --mount type=bind,source=/path/to/my_model,target=/models/my_model \
#   -e MODEL_NAME=my_model \
#   tensorflow/serving

# Make prediction request
import requests
import json

data = json.dumps({
    "signature_name": "serving_default",
    "instances": [[5.0, 2.0, 3.5, 1.0]]
})

response = requests.post(
    'http://localhost:8501/v1/models/my_model:predict',
    data=data
)

Features:

High performance (C++ core)
Model versioning
Batching support
gRPC and REST APIs

FastAPI for ML

Modern Python framework optimized for APIs.

from fastapi import FastAPI
from pydantic import BaseModel
import joblib

app = FastAPI()
model = joblib.load('model.pkl')

class PredictionInput(BaseModel):
    features: list[float]

class PredictionOutput(BaseModel):
    prediction: float
    confidence: float

@app.post("/predict", response_model=PredictionOutput)
async def predict(input_data: PredictionInput):
    prediction = model.predict([input_data.features])[0]
    confidence = model.predict_proba([input_data.features]).max()

    return PredictionOutput(
        prediction=prediction,
        confidence=confidence
    )

Benefits:

Automatic API documentation
Type validation with Pydantic
Async support
High performance

Seldon Core

Kubernetes-native ML deployment platform.

apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
  name: iris-model
spec:
  predictors:
  - name: default
    replicas: 3
    componentSpecs:
    - spec:
        containers:
        - name: classifier
          image: my-model:1.0.0
    graph:
      name: classifier
      type: MODEL

Features:

Auto-scaling
A/B testing built-in
Model monitoring
Multiple model servers (TensorFlow, PyTorch, Scikit-learn)

---PAGE---

Best Practices for Production ML

1. Model Versioning

Always version your models:

import mlflow

# Log model with version
with mlflow.start_run():
    mlflow.sklearn.log_model(
        model,
        "model",
        registered_model_name="fraud_detector"
    )
    mlflow.log_param("algorithm", "random_forest")
    mlflow.log_metric("accuracy", 0.95)

2. Input Validation

Validate all inputs before inference:

from pydantic import BaseModel, validator

class InputData(BaseModel):
    age: int
    income: float

    @validator('age')
    def age_must_be_positive(cls, v):
        if v < 0 or v > 120:
            raise ValueError('Age must be between 0 and 120')
        return v

    @validator('income')
    def income_must_be_reasonable(cls, v):
        if v < 0:
            raise ValueError('Income cannot be negative')
        return v

3. Monitoring and Logging

Track key metrics:

import logging
from prometheus_client import Counter, Histogram

# Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

# Metrics
prediction_counter = Counter('predictions_total', 'Total predictions')
prediction_latency = Histogram('prediction_latency_seconds', 'Prediction latency')

@app.post("/predict")
async def predict(data: InputData):
    with prediction_latency.time():
        result = model.predict([data.dict().values()])
        prediction_counter.inc()

    logger.info(f"Prediction made: {result}")
    return {"prediction": result.tolist()}

4. Error Handling

Gracefully handle errors:

from fastapi import HTTPException

@app.post("/predict")
async def predict(data: InputData):
    try:
        prediction = model.predict([data.features])
        return {"prediction": prediction.tolist()}
    except ValueError as e:
        logger.error(f"Invalid input: {e}")
        raise HTTPException(status_code=400, detail=str(e))
    except Exception as e:
        logger.error(f"Prediction failed: {e}")
        raise HTTPException(status_code=500, detail="Internal server error")

5. Health Checks

Implement health endpoints:

@app.get("/health")
async def health_check():
    return {
        "status": "healthy",
        "model_loaded": model is not None,
        "version": "1.0.0"
    }

@app.get("/ready")
async def readiness_check():
    # Check if model is loaded and ready
    if model is None:
        raise HTTPException(status_code=503, detail="Model not loaded")
    return {"status": "ready"}

---PAGE---

Performance Optimization

1. Batching Requests

Process multiple predictions together:

from fastapi import FastAPI
from typing import List

@app.post("/predict/batch")
async def predict_batch(inputs: List[InputData]):
    features = [list(item.dict().values()) for item in inputs]
    predictions = model.predict(features)

    return {
        "predictions": predictions.tolist(),
        "count": len(predictions)
    }

Benefits:

Better GPU/CPU utilization
Reduced overhead
Higher throughput

2. Model Caching

Cache frequently requested predictions:

from functools import lru_cache
import hashlib
import json

# Simple in-memory cache
prediction_cache = {}

def get_cache_key(features):
    return hashlib.md5(json.dumps(features).encode()).hexdigest()

@app.post("/predict")
async def predict(data: InputData):
    cache_key = get_cache_key(data.features)

    if cache_key in prediction_cache:
        return prediction_cache[cache_key]

    prediction = model.predict([data.features])
    result = {"prediction": prediction.tolist()}

    prediction_cache[cache_key] = result
    return result

3. Async Processing

For long-running predictions:

from fastapi import BackgroundTasks
import uuid

jobs = {}

@app.post("/predict/async")
async def predict_async(data: InputData, background_tasks: BackgroundTasks):
    job_id = str(uuid.uuid4())
    jobs[job_id] = {"status": "pending"}

    background_tasks.add_task(run_prediction, job_id, data)

    return {"job_id": job_id}

def run_prediction(job_id: str, data: InputData):
    try:
        prediction = model.predict([data.features])
        jobs[job_id] = {
            "status": "completed",
            "prediction": prediction.tolist()
        }
    except Exception as e:
        jobs[job_id] = {"status": "failed", "error": str(e)}

@app.get("/predict/async/{job_id}")
async def get_prediction_result(job_id: str):
    if job_id not in jobs:
        raise HTTPException(status_code=404, detail="Job not found")
    return jobs[job_id]

---PAGE---

Security Considerations

1. Authentication

Protect your endpoints:

from fastapi import Depends, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials

security = HTTPBearer()

def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
    token = credentials.credentials
    if token != "your-secret-token":  # Use proper auth in production
        raise HTTPException(
            status_code=status.HTTP_401_UNAUTHORIZED,
            detail="Invalid authentication credentials"
        )
    return token

@app.post("/predict")
async def predict(data: InputData, token: str = Depends(verify_token)):
    prediction = model.predict([data.features])
    return {"prediction": prediction.tolist()}

2. Rate Limiting

Prevent abuse:

from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)

@app.post("/predict")
@limiter.limit("100/minute")
async def predict(request: Request, data: InputData):
    prediction = model.predict([data.features])
    return {"prediction": prediction.tolist()}

3. Input Sanitization

Protect against malicious inputs:

import numpy as np

def sanitize_features(features: list) -> list:
    # Check for NaN or Inf
    if not all(np.isfinite(features)):
        raise ValueError("Features contain NaN or Inf values")

    # Check feature count
    if len(features) != expected_feature_count:
        raise ValueError(f"Expected {expected_feature_count} features")

    # Clip outliers
    features = np.clip(features, feature_min, feature_max)

    return features.tolist()

---PAGE---

---QUIZ--- TITLE: Model Deployment Knowledge Check INTRO: Let's test your understanding of model deployment and serving concepts!

Q: Which deployment pattern processes predictions at scheduled intervals rather than in real-time? A: Batch Prediction A: Real-time Prediction A: Streaming Prediction A: Shadow Deployment CORRECT: 0 EXPLAIN: Batch prediction processes large volumes of data at scheduled intervals (e.g., nightly or weekly), making it ideal for use cases that don't require immediate results like recommendation updates or periodic churn predictions.

Q: In a Blue-Green deployment strategy, what is the main advantage? A: Lower infrastructure costs A: Faster deployment time A: Zero downtime and easy rollback A: Automatic scaling CORRECT: 2 EXPLAIN: Blue-Green deployment maintains two identical environments, allowing you to switch traffic between them instantly, resulting in zero downtime. If issues arise, you can quickly rollback by switching back to the previous environment.

Q: What is the purpose of a Canary deployment? A: To test the model on synthetic data only A: To gradually roll out changes to a small percentage of users first A: To deploy multiple model versions simultaneously A: To cache prediction results CORRECT: 1 EXPLAIN: Canary deployment gradually routes a small percentage of production traffic (e.g., 5%) to the new model first. This allows you to validate the new model with real users while limiting risk. You can progressively increase traffic or rollback if issues are detected.

Q: Which of the following is NOT a key component of production-ready model serving? A: Input validation A: Health check endpoints A: Model versioning A: Training data storage CORRECT: 3 EXPLAIN: Training data storage is part of the model development phase, not model serving. Production model serving requires input validation (ensure valid data), health checks (monitor service status), and model versioning (track which model is deployed), but doesn't need access to training data.

Q: What is the primary benefit of using Docker containers for model deployment? A: Faster model training A: Better model accuracy A: Reproducible environments across different platforms A: Automatic hyperparameter tuning CORRECT: 2 EXPLAIN: Docker containers package your model, code, and all dependencies into a single unit that runs consistently across different environments (development, staging, production). This ensures reproducibility and eliminates "it works on my machine" problems.

Q: In the context of model serving, what does "inference" mean? A: Training a new model on production data A: Using a trained model to make predictions on new data A: Evaluating model performance on test data A: Updating model parameters CORRECT: 1 EXPLAIN: Inference is the process of using a trained model to make predictions on new, unseen data in production. It's distinct from training (building the model) and evaluation (testing model performance).

Q: Which metric is most important to monitor for real-time prediction services? A: Training accuracy A: Prediction latency A: Dataset size A: Number of features CORRECT: 1 EXPLAIN: For real-time prediction services, latency (response time) is critical because users expect immediate results. While accuracy is important, if your service is too slow, it won't meet real-time requirements. Training metrics and data characteristics are less relevant for production serving.

Q: What is the purpose of implementing rate limiting on a model serving endpoint? A: To improve model accuracy A: To prevent abuse and control infrastructure costs A: To speed up inference A: To cache predictions CORRECT: 1 EXPLAIN: Rate limiting restricts how many requests a client can make within a time window. This prevents abuse (intentional or accidental), protects your infrastructure from overload, and helps control costs by limiting resource usage. ---END-QUIZ---

---PAGE---

Key Takeaways

Congratulations! You've completed Part 1 of the MLOps series. Here's what you learned:

Core Concepts

✓ Model Serving: The process of making trained models available for predictions in production

✓ Serving Patterns: Batch, real-time, and streaming prediction approaches

✓ Deployment Strategies: Blue-green, canary, and shadow deployments for safe rollouts

Technical Skills

✓ REST APIs: Building prediction endpoints with Flask and FastAPI

✓ Containerization: Using Docker for reproducible deployments

✓ Frameworks: TensorFlow Serving, FastAPI, and Seldon Core

Best Practices

✓ Monitoring: Logging predictions and tracking performance metrics

✓ Validation: Checking inputs before inference

✓ Security: Authentication, rate limiting, and input sanitization

✓ Performance: Batching, caching, and async processing

Next Steps

In Part 2: Model Monitoring and Observability, you'll learn:

How to detect model drift in production
Monitoring prediction quality and data quality
Setting up alerts for model degradation
Building dashboards for ML systems

In Part 3: CI/CD for ML, you'll discover:

Automated testing for ML models
Building ML pipelines
Continuous training and deployment
Version control for models and data

In Part 4: Scaling ML Systems, you'll explore:

Horizontal and vertical scaling strategies
Load balancing for ML services
Distributed inference
Cost optimization

Resources for Further Learning

MLOps.org - Community resources and best practices
AWS SageMaker Documentation - Production ML on AWS
Google Cloud AI Platform - ML deployment on GCP
Kubernetes for ML - Container orchestration
MLflow Documentation - ML lifecycle management

Keep practicing, and see you in Part 2!