-
Notifications
You must be signed in to change notification settings - Fork 42
Module 2
By the end of this module, you'll have:
- ✅ REST API serving ML predictions
- ✅ Type-safe endpoints with automatic request/response validation
- ✅ Batch processing capability for high-throughput scenarios
- ✅ Comprehensive error handling and structured logging
- ✅ Health check endpoints for load balancer integration
- ✅ Swagger UI documentation auto-generated from your code
- ✅ Container-ready service deployable to Kubernetes
BentoML simplifies ML model serving by providing:
| Without BentoML | With BentoML |
|---|---|
| Manual API boilerplate | Automatic API generation |
| Custom serialization logic | Built-in model packaging |
| Manual Docker setup | One-command containerization |
| DIY health checks | Production endpoints included |
| Complex deployment configs | Simple bentofile.yaml |
| No automatic docs | Auto-generated Swagger UI |
Key Advantage: Focus on ML logic, not infrastructure plumbing.
By the end of this module, you will:
- ✅ Package ML models as REST APIs using BentoML 1.4+ (class-based services)
- ✅ Implement input validation with Pydantic v2
- ✅ Add error handling and logging for production
- ✅ Create batch processing endpoints
- ✅ Build production-ready ML services with proper monitoring
- Completed Module 1
- Python 3.9+ installed
- Basic understanding of REST APIs
- Basic knowledge of Python classes and decorators
This module uses a scaffolded learning approach with BentoML 1.4+ API where you'll complete three progressive exercises:
Exercise 1: Basic BentoML Service
├─ Define service class with @bentoml.service
├─ Initialize model in __init__
├─ Create prediction endpoint with @bentoml.api
└─ Use Python type hints for I/O
Exercise 2: Validation & Production Features
├─ Part 1: Pydantic Validation
└─ Part 2: Production Features
Benefits of the new API:
- ✅ Cleaner, more Pythonic class-based architecture
- ✅ Better type safety with native Python type hints
- ✅ Simpler model management (no separate save/load steps)
- ✅ Automatic OpenAPI spec generation
- ✅ Better IDE support and auto-completion
cd modules/module-2/starter
# Install dependencies (includes BentoML 1.4+)
pip install -r ../requirements.txtGoal: Create a basic sentiment analysis API with BentoML services
# Run the service
bentoml serve service_basic:SentimentService
# Test it (macOS/Linux/WSL)
# Note: basic service takes a plain string — BentoML wraps it under the parameter name "text"
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"text": "This is amazing!"}'
# Test it (Windows PowerShell)
$body = '{"text": "This is amazing!"}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body
# Visit Swagger UI (macOS)
open http://localhost:3000
# Visit Swagger UI (Windows)
start http://localhost:3000Key TODOs to Complete
TODO 1: Add @bentoml.service decorator to the class
# FILL IN: @bentoml.service(resources={"cpu": "2"}, traffic={"timeout": 30})
# Hint: Place the decorator directly above the class definitionTODO 2: Define __init__ method
# FILL IN: def __init__(self) -> None:
# Hint: This runs once at startup — the right place to load your modelTODO 3: Load the sentiment analysis pipeline
# FILL IN: self.pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
# Hint: Import pipeline from transformers at the top of the fileTODO 4: Add @bentoml.api decorator to the predict method
# FILL IN: @bentoml.api
# Hint: This exposes the method as an HTTP POST endpointTODO 5: Extract text and run prediction
# FILL IN: result = self.pipeline(text)
# Hint: self.pipeline accepts a string and returns a list of dictsTODO 6: Return the first result from the prediction list
# FILL IN: return result[0]
# Hint: The pipeline always returns a list — grab index 0Goal: Build a production-ready service with Pydantic validation, error handling, logging, and batch processing
# Run the service
bentoml serve service_with_validation:SentimentService
# Test valid input with tracking (macOS/Linux/WSL)
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "Amazing!", "request_id": "test-123"}}'
# Test valid input with tracking (Windows PowerShell)
$body = '{"request": {"text": "Amazing!", "request_id": "test-123"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body
# Test invalid input (macOS/Linux/WSL)
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": ""}}'
# Test invalid input (Windows PowerShell)
$body = '{"request": {"text": ""}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body
# Test batch prediction (macOS/Linux/WSL)
curl -X POST http://localhost:3000/batch_predict \
-H "Content-Type: application/json" \
-d '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'
# Test batch prediction (Windows PowerShell)
$body = '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/batch_predict -ContentType "application/json" -Body $body
# Check health (macOS/Linux/WSL)
curl http://localhost:3000/health
# Check health (Windows PowerShell)
Invoke-RestMethod -Uri http://localhost:3000/health
# Visit Swagger UI (macOS)
open http://localhost:3000
# Visit Swagger UI (Windows)
start http://localhost:3000
# Watch logs for request tracking
# Look for: [test-123] Prediction successful with latency metricsPart 1: Pydantic Validation — Key TODOs to Complete
TODO 1: Import Pydantic
# FILL IN: from pydantic import BaseModel, Field, field_validatorTODO 2: Import standard library modules for production features
# FILL IN: from typing import List, Optional
# FILL IN: import time, logging, uuid
# FILL IN: from datetime import datetimeTODO 3: Define the SentimentRequest model
# FILL IN: class SentimentRequest(BaseModel):
# text: str = Field(..., min_length=1, max_length=5000, description="Text to analyse")
# request_id: Optional[str] = Field(None, description="Optional request ID for tracing")TODO 4: Add a custom validator for the text field (Pydantic v2 style)
# FILL IN: @field_validator('text')
# @classmethod
# def text_must_not_be_empty_or_whitespace(cls, v: str) -> str:
# if not v or v.strip() == "":
# raise ValueError('Text cannot be empty or just whitespace')
# return v.strip()TODO 5: Define the SentimentResponse model
# FILL IN: class SentimentResponse(BaseModel):
# text: str
# sentiment: str
# confidence: float = Field(..., ge=0.0, le=1.0)
# request_id: str
# timestamp: strTODO 6: Define the BatchSentimentRequest model
# FILL IN: class BatchSentimentRequest(BaseModel):
# texts: List[str] = Field(..., min_length=1, max_length=100)
# request_id: Optional[str] = NoneTODO 7: Define the BatchSentimentResponse model
# FILL IN: class BatchSentimentResponse(BaseModel):
# results: List[SentimentResponse]
# metadata: dict
# request_id: strTODO 8: Define the ErrorResponse model
# FILL IN: class ErrorResponse(BaseModel):
# error: str
# message: str
# request_id: str
# timestamp: strTODO 9: Configure logging
# FILL IN: logging.basicConfig(
# level=logging.INFO,
# format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
# datefmt='%Y-%m-%d %H:%M:%S'
# )TODO 10: Create a logger instance
# FILL IN: logger = logging.getLogger(__name__)TODO 11: Implement generate_request_id()
# FILL IN: def generate_request_id(provided_id: Optional[str] = None) -> str:
# if provided_id:
# return provided_id
# return str(uuid.uuid4())[:8]TODO 12: Implement get_timestamp()
# FILL IN: def get_timestamp() -> str:
# return datetime.utcnow().isoformat()Part 2: Production Features — Key TODOs to Complete
TODO 13: Add @bentoml.service decorator to the class
# FILL IN: @bentoml.service(resources={"cpu": "2"}, traffic={"timeout": 30})TODO 14: Load the pipeline in __init__
# FILL IN: self.pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")TODO 15: Log that the model is ready
# FILL IN: logger.info("Model loaded and ready")TODO 16: Add @bentoml.api decorator to predict
# FILL IN: @bentoml.apiTODO 17: Log the incoming request
# FILL IN: logger.info(f"[{request_id}] Single prediction request")TODO 18: Wrap the prediction logic in a try/except block
# FILL IN: try:
# ...prediction logic...
# except Exception as e:
# ...error handling...TODO 19: Run the prediction
# FILL IN: result = self.pipeline(request.text)TODO 20: Log the successful prediction with latency
# FILL IN: logger.info(f"[{request_id}] Prediction successful", extra={"latency_ms": round(latency, 2)})TODO 21: Return a SentimentResponse with all fields
# FILL IN: return SentimentResponse(
# text=request.text,
# sentiment=result[0]['label'],
# confidence=round(result[0]['score'], 4),
# request_id=request_id,
# timestamp=get_timestamp()
# )TODO 22: Log the error with stack trace
# FILL IN: logger.error(f"[{request_id}] Prediction failed: {str(e)}", exc_info=True)TODO 23: Return an error SentimentResponse
# FILL IN: return SentimentResponse(
# text=request.text, sentiment="ERROR", confidence=0.0,
# request_id=request_id, timestamp=get_timestamp()
# )TODO 24: Add @bentoml.api decorator to batch_predict
# FILL IN: @bentoml.apiTODO 25: Add @bentoml.api and implement the health check
# FILL IN: @bentoml.api
# def health(self) -> dict:
# return {"status": "healthy", "service": "sentiment_analysis", "timestamp": get_timestamp()}For
service_basicBentoML wraps thestrparameter under its argument nametext.
# macOS/Linux/WSL — service_basic
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"text": "This workshop is amazing!"}'# Windows PowerShell — service_basic
$body = '{"text": "This workshop is amazing!"}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $bodyFor
service_with_validationBentoML wraps the Pydantic model under the argument namerequest.
# macOS/Linux/WSL — service_with_validation
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "This workshop is amazing!", "request_id": "test-123"}}'# Windows PowerShell — service_with_validation
$body = '{"request": {"text": "This workshop is amazing!", "request_id": "test-123"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $bodyResponse:
{
"text": "This workshop is amazing!",
"sentiment": "POSITIVE",
"confidence": 0.9998,
"request_id": "abc123"
}# macOS/Linux/WSL
curl -X POST http://localhost:3000/batch_predict \
-H "Content-Type: application/json" \
-d '{"request": {"texts": ["I loved it!", "Terrible experience.", "Pretty good overall."], "request_id": "batch-456"}}'# Windows PowerShell
$body = '{"request": {"texts": ["I loved it!", "Terrible experience.", "Pretty good overall."], "request_id": "batch-456"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/batch_predict -ContentType "application/json" -Body $bodyResponse:
{
"results": [
{"text": "I loved it!", "sentiment": "POSITIVE", "confidence": 0.9995, ...},
{"text": "Terrible experience.", "sentiment": "NEGATIVE", "confidence": 0.9991, ...},
{"text": "Pretty good overall.", "sentiment": "POSITIVE", "confidence": 0.8876, ...}
],
"metadata": {
"count": 3,
"latency_ms": 45.2,
"throughput_per_sec": 66.4,
"avg_latency_per_text_ms": 15.07
},
"request_id": "batch-456"
}Prediction error:
{
"text": "test input",
"sentiment": "ERROR",
"confidence": 0.0,
"request_id": "abc123"
}Once your service is working, package it as a Bento for deployment:
# Build Bento (creates distributable package)
bentoml build
# List available Bentos
bentoml list
Convert your Bento to a Docker container:
# Containerize the latest Bento
bentoml containerize sentiment_service:latest -t sentiment-api:v1
# Or specify a specific version
bentoml containerize sentiment_service:abc123 -t sentiment-api:v1.0.0
# List Docker images (macOS/Linux/WSL)
docker images | grep sentiment-api
# List Docker images (Windows PowerShell)
docker images | Select-String sentiment-apiTest your containerized service locally before deploying to Kubernetes:
# Run container
docker run -p 3000:3000 sentiment-api:v1
# Test the containerized service (macOS/Linux/WSL)
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "Testing containerized service!"}}'
# Test the containerized service (Windows PowerShell)
$body = '{"request": {"text": "Testing containerized service!"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body
# Check health endpoint (macOS/Linux/WSL)
curl http://localhost:3000/health
# Check health endpoint (Windows PowerShell)
Invoke-RestMethod -Uri http://localhost:3000/health
# View container logs
docker logs <container-id>
# Stop container
docker stop <container-id>Next: In Module 3, you'll deploy this container to Kubernetes!
-
Service Classes: Define services with
@bentoml.servicedecorator -
Initialization: Load models in
__init__()method (runs once at startup) -
API Endpoints: Create routes with
@bentoml.apidecorator on methods - Type Hints: Use Python type hints for automatic I/O handling
- Resource Config: Set CPU/memory requirements in decorator
-
Request Models: Type-safe input validation with
BaseModel - Response Models: Structured output format
-
Field Constraints:
Field()withmin_length,max_length,ge,le -
Custom Validators:
@field_validatordecorator (Pydantic v2) -
Model Config:
model_configdict withjson_schema_extra
- Error Handling: Try/except with graceful error encoding
- Logging: Structured logs with request IDs for tracing
- Request Tracking: Unique IDs for debugging across services
- Performance Metrics: Latency and throughput monitoring
-
Health Checks:
/healthendpoint for load balancers
- Batch Endpoints: Process multiple inputs efficiently
- Metadata: Track performance metrics per batch
- Throughput: 5-10x speedup vs individual requests
- Error Handling: Graceful degradation for batch failures
Symptoms:
Error: Address already in use
OSError: [Errno 48] Address already in use
Solutions:
Option 1: Use different port
bentoml serve service_with_validation:SentimentService --port 3001Option 2: Kill existing process
# macOS/Linux
lsof -i :3000
kill -9 <PID># Windows PowerShell
netstat -ano | findstr :3000
# Note the PID from the last column, then:
taskkill /PID <PID> /FOption 3: Find and stop BentoML service
# macOS/Linux: Kill all BentoML processes
pkill -f "bentoml serve"
# Or more targeted
ps aux | grep bentoml
kill <PID># Windows PowerShell
Get-Process | Where-Object { $_.CommandLine -like "*bentoml*" } | Stop-Process -Force
# Or use Task Manager to find and end the processSymptoms:
ModuleNotFoundError: No module named 'bentoml'
ImportError: cannot import name 'field_validator' from 'pydantic'
Solutions:
# Step 1: Activate virtual environment
source venv/bin/activate # macOS / Linux / WSL
# Windows PowerShell: venv\Scripts\Activate.ps1
# Windows CMD: venv\Scripts\activate.bat
# Step 2: Reinstall dependencies
pip install -r requirements.txt
# Step 3: Verify BentoML version (should be 1.4+)
pip show bentoml
# Version should be >= 1.4.0
# Step 4: Verify Pydantic version (should be v2)
pip show pydantic
# Version should be >= 2.0.0
# Step 5: Check Python version
python --version
# Should be >= 3.9If issues persist:
# Clean install
pip uninstall bentoml pydantic -y
pip install --no-cache-dir bentoml>=1.4.0 pydantic>=2.0.0Symptoms:
HTTPError: 404 Client Error
OSError: Can't load tokenizer for 'distilbert-base-uncased'
Solutions:
Check 1: Model downloads automatically on first run
# Just start the service
bentoml serve service_basic:SentimentService
# Model downloads to cache (may take 1-2 minutes first time)
# Location: ~/.cache/huggingface/hub/Check 2: Verify cache location
# macOS/Linux/WSL
ls ~/.cache/huggingface/hub/
# Should show model files after first run# Windows PowerShell
dir $env:USERPROFILE\.cache\huggingface\hub\Check 3: Manual download (if network issues)
# Pre-download model
python -c "
from transformers import pipeline
model = pipeline('sentiment-analysis', model='distilbert-base-uncased')
print('Model downloaded!')
"Check 4: Clear cache if corrupted
# macOS/Linux/WSL
rm -rf ~/.cache/huggingface/hub/
# Then restart service to re-download# Windows PowerShell
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\"
# Then restart service to re-downloadStill stuck? Check the solution files
# Navigate to module
cd modules/module-2
# Install dependencies
pip install -r requirements.txt
# Serve basic service
cd starter
bentoml serve service_basic:SentimentService
# Serve with auto-reload (development)
bentoml serve service_with_validation:SentimentService --reload
# Serve on different port
bentoml serve service_with_validation:SentimentService --port 3001# Serve with live reload (changes auto-reload)
bentoml serve service_with_validation:SentimentService --reload
# Serve with specific host
bentoml serve service_with_validation:SentimentService --host 0.0.0.0
# Serve with development mode (more verbose logging)
bentoml serve service_with_validation:SentimentService --reload --host 0.0.0.0 --port 3000
# View all serve options
bentoml serve --help# macOS/Linux/WSL
# Single prediction — service_basic (str param, wrapped as {"text": "..."})
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"text": "This is amazing!"}'
# Single prediction — service_with_validation (Pydantic model, wrapped as {"request": {...}})
curl -X POST http://localhost:3000/predict \
-H "Content-Type: application/json" \
-d '{"request": {"text": "This is amazing!"}}'
# Batch prediction (Pydantic model, wrapped as {"request": {...}})
curl -X POST http://localhost:3000/batch_predict \
-H "Content-Type: application/json" \
-d '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'
# Health check
curl http://localhost:3000/health
# Get OpenAPI spec
curl http://localhost:3000/docs.json
# Visit Swagger UI (in browser)
open http://localhost:3000# Windows PowerShell
# Single prediction — service_basic (str param, wrapped as {"text": "..."})
$body = '{"text": "This is amazing!"}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body
# Single prediction — service_with_validation (Pydantic model, wrapped as {"request": {...}})
$body = '{"request": {"text": "This is amazing!"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body
# Batch prediction (Pydantic model, wrapped as {"request": {...}})
$body = '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/batch_predict -ContentType "application/json" -Body $body
# Health check
Invoke-RestMethod -Uri http://localhost:3000/health
# Get OpenAPI spec
Invoke-RestMethod -Uri http://localhost:3000/docs.json
# Visit Swagger UI (in browser)
start http://localhost:3000# Build Bento from bentofile.yaml
bentoml build
# List all Bentos
bentoml list
# Get Bento details
bentoml get sentiment_service:latest
# Delete specific Bento
bentoml delete sentiment_service:abc123
# Delete all versions of a Bento
bentoml delete sentiment_service --yes
# Export Bento to file
bentoml export sentiment_service:latest -o sentiment_service.bento
# Import Bento from file
bentoml import sentiment_service.bento# Build Docker image from Bento
bentoml containerize sentiment_service:latest
# Build with custom tag
bentoml containerize sentiment_service:latest -t sentiment-api:v1.0.0
# Build with custom Dockerfile template
bentoml containerize sentiment_service:latest --dockerfile-template ./custom.Dockerfile
# Push to registry
docker tag sentiment_service:latest myregistry.com/sentiment-service:v1
docker push myregistry.com/sentiment-service:v1If you get stuck, reference implementations are available in solution/:
-
service_basic.py- Exercise 1 completed -
service_with_validation.py- Exercise 2 completed
Note: Try to complete exercises on your own first! Learning happens when you struggle a bit.
After completing all exercises, try these:
- Add Caching: Implement response caching for repeated requests
- Async Endpoints: Convert to async/await for better concurrency
-
Metrics Endpoint: Add
/metricsendpoint for Prometheus - Custom Models: Replace with a different HuggingFace model
- Multiple Endpoints: Add sentiment + topic classification
Once you've completed all exercises and tests pass:
→ Module 3: Kubernetes Deployment
In Module 3, you'll deploy this BentoML service to Kubernetes!
Having issues? Check the Troubleshooting section or review the solution files!
| Previous | Home | Next |
|---|---|---|
| ← Module 1: Model Training & Experiment Tracking | 🏠 Home | Module 3: Kubernetes Deployment → |
MLOps Workshop | GitHub Repository