MLOps Part 1: Model Deployment and Serving
Learn the fundamentals of deploying and serving machine learning models in production environments, including deployment strategies, serving architectures, and best practices
Introduction to Model Deployment
Deploying a machine learning model to production is one of the most critical steps in the ML lifecycle. It's where your model transitions from development to delivering real business value.
What You'll Learn
In this module, you'll understand:
- Deployment fundamentals: What it means to deploy an ML model
- Serving architectures: Different ways to serve predictions
- Deployment strategies: How to safely roll out models
- Best practices: Industry standards for production ML systems
Why Model Deployment Matters
Many great ML models never make it to production. According to industry surveys:
- 87% of ML projects never make it past the prototype stage
- The primary blocker is deployment complexity
- Production ML requires skills beyond model training
Let's bridge that gap!
---PAGE---
Understanding Model Serving
Model serving is the process of making your trained model available to make predictions on new data in a production environment.
Key Concepts
Inference: The process of using a trained model to make predictions on new, unseen data.
Serving: The infrastructure and systems that host your model and handle inference requests.
Common Serving Patterns
1. Batch Prediction
- Process large volumes of data at scheduled intervals
- Examples: Nightly recommendation updates, weekly churn predictions
- Pros: Simple, efficient for large datasets
- Cons: Not real-time, can't handle immediate requests
2. Real-time (Online) Prediction
- Respond to individual requests instantly
- Examples: Fraud detection, content recommendation at click-time
- Pros: Immediate results, personalized responses
- Cons: Higher infrastructure costs, latency requirements
3. Streaming Prediction
- Continuous processing of streaming data
- Examples: Anomaly detection in sensor data, real-time monitoring
- Pros: Low latency, handles continuous data
- Cons: Complex infrastructure, harder to debug
---PAGE---
Deployment Architectures
REST API Deployment
The most common pattern for serving ML models is through REST APIs.
from flask import Flask, request, jsonify
import joblib
app = Flask(__name__)
# Load model at startup
model = joblib.load('model.pkl')
@app.route('/predict', methods=['POST'])
def predict():
data = request.get_json()
features = data['features']
# Make prediction
prediction = model.predict([features])
return jsonify({
'prediction': prediction.tolist(),
'model_version': '1.0.0'
})
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000)
Key Components:
- Endpoint:
/predictaccepts POST requests - Input validation: Check feature format and types
- Model inference: Call
model.predict() - Response formatting: Return JSON with prediction
Containerized Deployment
Containers ensure consistency across environments.
FROM python:3.9-slim
WORKDIR /app
# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy model and code
COPY model.pkl .
COPY app.py .
# Expose port
EXPOSE 5000
# Run application
CMD ["python", "app.py"]
Benefits:
- Reproducible environments
- Easy scaling (multiple containers)
- Platform independent
- Version control for entire stack
---PAGE---
Deployment Strategies
1. Blue-Green Deployment
Run two identical production environments (Blue and Green). One serves live traffic while the other is idle.
Process:
- Blue environment serves production traffic
- Deploy new model to Green environment
- Test Green environment thoroughly
- Switch traffic from Blue to Green
- Keep Blue as rollback option
Advantages:
- Zero downtime
- Easy rollback
- Full testing before switch
Disadvantages:
- 2x infrastructure cost
- Requires load balancer
2. Canary Deployment
Gradually roll out new model to small percentage of users.
Process:
- Deploy new model alongside old model
- Route 5% of traffic to new model
- Monitor metrics (accuracy, latency, errors)
- Gradually increase to 25%, 50%, 100%
- Rollback if issues detected
Advantages:
- Reduced risk
- Real-world validation
- Progressive rollout
Disadvantages:
- Complex routing logic
- Longer deployment time
- Need robust monitoring
3. Shadow Deployment
New model receives real traffic but predictions aren't used.
Process:
- Deploy new model in shadow mode
- Send copy of production traffic to new model
- Compare predictions with current model
- Collect metrics without affecting users
- Promote to production when confident
Advantages:
- Zero user impact
- Real production data
- Side-by-side comparison
Disadvantages:
- Requires duplicate infrastructure
- Only validates predictions, not full system
---PAGE---
Model Serving Frameworks
Several frameworks simplify model deployment:
TensorFlow Serving
Designed for serving TensorFlow models at scale.
# Save model in SavedModel format
import tensorflow as tf
model.save('my_model/1/') # Version 1
# Start TensorFlow Serving
# docker run -p 8501:8501 \
# --mount type=bind,source=/path/to/my_model,target=/models/my_model \
# -e MODEL_NAME=my_model \
# tensorflow/serving
# Make prediction request
import requests
import json
data = json.dumps({
"signature_name": "serving_default",
"instances": [[5.0, 2.0, 3.5, 1.0]]
})
response = requests.post(
'http://localhost:8501/v1/models/my_model:predict',
data=data
)
Features:
- High performance (C++ core)
- Model versioning
- Batching support
- gRPC and REST APIs
FastAPI for ML
Modern Python framework optimized for APIs.
from fastapi import FastAPI
from pydantic import BaseModel
import joblib
app = FastAPI()
model = joblib.load('model.pkl')
class PredictionInput(BaseModel):
features: list[float]
class PredictionOutput(BaseModel):
prediction: float
confidence: float
@app.post("/predict", response_model=PredictionOutput)
async def predict(input_data: PredictionInput):
prediction = model.predict([input_data.features])[0]
confidence = model.predict_proba([input_data.features]).max()
return PredictionOutput(
prediction=prediction,
confidence=confidence
)
Benefits:
- Automatic API documentation
- Type validation with Pydantic
- Async support
- High performance
Seldon Core
Kubernetes-native ML deployment platform.
apiVersion: machinelearning.seldon.io/v1
kind: SeldonDeployment
metadata:
name: iris-model
spec:
predictors:
- name: default
replicas: 3
componentSpecs:
- spec:
containers:
- name: classifier
image: my-model:1.0.0
graph:
name: classifier
type: MODEL
Features:
- Auto-scaling
- A/B testing built-in
- Model monitoring
- Multiple model servers (TensorFlow, PyTorch, Scikit-learn)
---PAGE---
Best Practices for Production ML
1. Model Versioning
Always version your models:
import mlflow
# Log model with version
with mlflow.start_run():
mlflow.sklearn.log_model(
model,
"model",
registered_model_name="fraud_detector"
)
mlflow.log_param("algorithm", "random_forest")
mlflow.log_metric("accuracy", 0.95)
2. Input Validation
Validate all inputs before inference:
from pydantic import BaseModel, validator
class InputData(BaseModel):
age: int
income: float
@validator('age')
def age_must_be_positive(cls, v):
if v < 0 or v > 120:
raise ValueError('Age must be between 0 and 120')
return v
@validator('income')
def income_must_be_reasonable(cls, v):
if v < 0:
raise ValueError('Income cannot be negative')
return v
3. Monitoring and Logging
Track key metrics:
import logging
from prometheus_client import Counter, Histogram
# Logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
# Metrics
prediction_counter = Counter('predictions_total', 'Total predictions')
prediction_latency = Histogram('prediction_latency_seconds', 'Prediction latency')
@app.post("/predict")
async def predict(data: InputData):
with prediction_latency.time():
result = model.predict([data.dict().values()])
prediction_counter.inc()
logger.info(f"Prediction made: {result}")
return {"prediction": result.tolist()}
4. Error Handling
Gracefully handle errors:
from fastapi import HTTPException
@app.post("/predict")
async def predict(data: InputData):
try:
prediction = model.predict([data.features])
return {"prediction": prediction.tolist()}
except ValueError as e:
logger.error(f"Invalid input: {e}")
raise HTTPException(status_code=400, detail=str(e))
except Exception as e:
logger.error(f"Prediction failed: {e}")
raise HTTPException(status_code=500, detail="Internal server error")
5. Health Checks
Implement health endpoints:
@app.get("/health")
async def health_check():
return {
"status": "healthy",
"model_loaded": model is not None,
"version": "1.0.0"
}
@app.get("/ready")
async def readiness_check():
# Check if model is loaded and ready
if model is None:
raise HTTPException(status_code=503, detail="Model not loaded")
return {"status": "ready"}
---PAGE---
Performance Optimization
1. Batching Requests
Process multiple predictions together:
from fastapi import FastAPI
from typing import List
@app.post("/predict/batch")
async def predict_batch(inputs: List[InputData]):
features = [list(item.dict().values()) for item in inputs]
predictions = model.predict(features)
return {
"predictions": predictions.tolist(),
"count": len(predictions)
}
Benefits:
- Better GPU/CPU utilization
- Reduced overhead
- Higher throughput
2. Model Caching
Cache frequently requested predictions:
from functools import lru_cache
import hashlib
import json
# Simple in-memory cache
prediction_cache = {}
def get_cache_key(features):
return hashlib.md5(json.dumps(features).encode()).hexdigest()
@app.post("/predict")
async def predict(data: InputData):
cache_key = get_cache_key(data.features)
if cache_key in prediction_cache:
return prediction_cache[cache_key]
prediction = model.predict([data.features])
result = {"prediction": prediction.tolist()}
prediction_cache[cache_key] = result
return result
3. Async Processing
For long-running predictions:
from fastapi import BackgroundTasks
import uuid
jobs = {}
@app.post("/predict/async")
async def predict_async(data: InputData, background_tasks: BackgroundTasks):
job_id = str(uuid.uuid4())
jobs[job_id] = {"status": "pending"}
background_tasks.add_task(run_prediction, job_id, data)
return {"job_id": job_id}
def run_prediction(job_id: str, data: InputData):
try:
prediction = model.predict([data.features])
jobs[job_id] = {
"status": "completed",
"prediction": prediction.tolist()
}
except Exception as e:
jobs[job_id] = {"status": "failed", "error": str(e)}
@app.get("/predict/async/{job_id}")
async def get_prediction_result(job_id: str):
if job_id not in jobs:
raise HTTPException(status_code=404, detail="Job not found")
return jobs[job_id]
---PAGE---
Security Considerations
1. Authentication
Protect your endpoints:
from fastapi import Depends, HTTPException, status
from fastapi.security import HTTPBearer, HTTPAuthorizationCredentials
security = HTTPBearer()
def verify_token(credentials: HTTPAuthorizationCredentials = Depends(security)):
token = credentials.credentials
if token != "your-secret-token": # Use proper auth in production
raise HTTPException(
status_code=status.HTTP_401_UNAUTHORIZED,
detail="Invalid authentication credentials"
)
return token
@app.post("/predict")
async def predict(data: InputData, token: str = Depends(verify_token)):
prediction = model.predict([data.features])
return {"prediction": prediction.tolist()}
2. Rate Limiting
Prevent abuse:
from slowapi import Limiter, _rate_limit_exceeded_handler
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, _rate_limit_exceeded_handler)
@app.post("/predict")
@limiter.limit("100/minute")
async def predict(request: Request, data: InputData):
prediction = model.predict([data.features])
return {"prediction": prediction.tolist()}
3. Input Sanitization
Protect against malicious inputs:
import numpy as np
def sanitize_features(features: list) -> list:
# Check for NaN or Inf
if not all(np.isfinite(features)):
raise ValueError("Features contain NaN or Inf values")
# Check feature count
if len(features) != expected_feature_count:
raise ValueError(f"Expected {expected_feature_count} features")
# Clip outliers
features = np.clip(features, feature_min, feature_max)
return features.tolist()
---PAGE---
---QUIZ--- TITLE: Model Deployment Knowledge Check INTRO: Let's test your understanding of model deployment and serving concepts!
Q: Which deployment pattern processes predictions at scheduled intervals rather than in real-time? A: Batch Prediction A: Real-time Prediction A: Streaming Prediction A: Shadow Deployment CORRECT: 0 EXPLAIN: Batch prediction processes large volumes of data at scheduled intervals (e.g., nightly or weekly), making it ideal for use cases that don't require immediate results like recommendation updates or periodic churn predictions.
Q: In a Blue-Green deployment strategy, what is the main advantage? A: Lower infrastructure costs A: Faster deployment time A: Zero downtime and easy rollback A: Automatic scaling CORRECT: 2 EXPLAIN: Blue-Green deployment maintains two identical environments, allowing you to switch traffic between them instantly, resulting in zero downtime. If issues arise, you can quickly rollback by switching back to the previous environment.
Q: What is the purpose of a Canary deployment? A: To test the model on synthetic data only A: To gradually roll out changes to a small percentage of users first A: To deploy multiple model versions simultaneously A: To cache prediction results CORRECT: 1 EXPLAIN: Canary deployment gradually routes a small percentage of production traffic (e.g., 5%) to the new model first. This allows you to validate the new model with real users while limiting risk. You can progressively increase traffic or rollback if issues are detected.
Q: Which of the following is NOT a key component of production-ready model serving? A: Input validation A: Health check endpoints A: Model versioning A: Training data storage CORRECT: 3 EXPLAIN: Training data storage is part of the model development phase, not model serving. Production model serving requires input validation (ensure valid data), health checks (monitor service status), and model versioning (track which model is deployed), but doesn't need access to training data.
Q: What is the primary benefit of using Docker containers for model deployment? A: Faster model training A: Better model accuracy A: Reproducible environments across different platforms A: Automatic hyperparameter tuning CORRECT: 2 EXPLAIN: Docker containers package your model, code, and all dependencies into a single unit that runs consistently across different environments (development, staging, production). This ensures reproducibility and eliminates "it works on my machine" problems.
Q: In the context of model serving, what does "inference" mean? A: Training a new model on production data A: Using a trained model to make predictions on new data A: Evaluating model performance on test data A: Updating model parameters CORRECT: 1 EXPLAIN: Inference is the process of using a trained model to make predictions on new, unseen data in production. It's distinct from training (building the model) and evaluation (testing model performance).
Q: Which metric is most important to monitor for real-time prediction services? A: Training accuracy A: Prediction latency A: Dataset size A: Number of features CORRECT: 1 EXPLAIN: For real-time prediction services, latency (response time) is critical because users expect immediate results. While accuracy is important, if your service is too slow, it won't meet real-time requirements. Training metrics and data characteristics are less relevant for production serving.
Q: What is the purpose of implementing rate limiting on a model serving endpoint? A: To improve model accuracy A: To prevent abuse and control infrastructure costs A: To speed up inference A: To cache predictions CORRECT: 1 EXPLAIN: Rate limiting restricts how many requests a client can make within a time window. This prevents abuse (intentional or accidental), protects your infrastructure from overload, and helps control costs by limiting resource usage. ---END-QUIZ---
---PAGE---
Key Takeaways
Congratulations! You've completed Part 1 of the MLOps series. Here's what you learned:
Core Concepts
✓ Model Serving: The process of making trained models available for predictions in production
✓ Serving Patterns: Batch, real-time, and streaming prediction approaches
✓ Deployment Strategies: Blue-green, canary, and shadow deployments for safe rollouts
Technical Skills
✓ REST APIs: Building prediction endpoints with Flask and FastAPI
✓ Containerization: Using Docker for reproducible deployments
✓ Frameworks: TensorFlow Serving, FastAPI, and Seldon Core
Best Practices
✓ Monitoring: Logging predictions and tracking performance metrics
✓ Validation: Checking inputs before inference
✓ Security: Authentication, rate limiting, and input sanitization
✓ Performance: Batching, caching, and async processing
Next Steps
In Part 2: Model Monitoring and Observability, you'll learn:
- How to detect model drift in production
- Monitoring prediction quality and data quality
- Setting up alerts for model degradation
- Building dashboards for ML systems
In Part 3: CI/CD for ML, you'll discover:
- Automated testing for ML models
- Building ML pipelines
- Continuous training and deployment
- Version control for models and data
In Part 4: Scaling ML Systems, you'll explore:
- Horizontal and vertical scaling strategies
- Load balancing for ML services
- Distributed inference
- Cost optimization
Resources for Further Learning
- MLOps.org - Community resources and best practices
- AWS SageMaker Documentation - Production ML on AWS
- Google Cloud AI Platform - ML deployment on GCP
- Kubernetes for ML - Container orchestration
- MLflow Documentation - ML lifecycle management
Keep practicing, and see you in Part 2!