Module 2

Module 2: Model Packaging with BentoML 1.4+

What You'll Build

By the end of this module, you'll have:

✅ REST API serving ML predictions
✅ Type-safe endpoints with automatic request/response validation
✅ Batch processing capability for high-throughput scenarios
✅ Comprehensive error handling and structured logging
✅ Health check endpoints for load balancer integration
✅ Swagger UI documentation auto-generated from your code
✅ Container-ready service deployable to Kubernetes

What You'll Learn

Why BentoML?

BentoML simplifies ML model serving by providing:

Without BentoML	With BentoML
Manual API boilerplate	Automatic API generation
Custom serialization logic	Built-in model packaging
Manual Docker setup	One-command containerization
DIY health checks	Production endpoints included
Complex deployment configs	Simple bentofile.yaml
No automatic docs	Auto-generated Swagger UI

Key Advantage: Focus on ML logic, not infrastructure plumbing.

Learning Objectives

By the end of this module, you will:

✅ Package ML models as REST APIs using BentoML 1.4+ (class-based services)
✅ Implement input validation with Pydantic v2
✅ Add error handling and logging for production
✅ Create batch processing endpoints
✅ Build production-ready ML services with proper monitoring

Part 1: Setup & Prerequisites

Prerequisites

Completed Module 1
Python 3.9+ installed
Basic understanding of REST APIs
Basic knowledge of Python classes and decorators

Workshop Format

This module uses a scaffolded learning approach with BentoML 1.4+ API where you'll complete three progressive exercises:

Exercise 1: Basic BentoML Service
├─ Define service class with @bentoml.service
├─ Initialize model in __init__
├─ Create prediction endpoint with @bentoml.api
└─ Use Python type hints for I/O

Exercise 2: Validation & Production Features
├─ Part 1: Pydantic Validation
└─ Part 2: Production Features

Benefits of the new API:

✅ Cleaner, more Pythonic class-based architecture
✅ Better type safety with native Python type hints
✅ Simpler model management (no separate save/load steps)
✅ Automatic OpenAPI spec generation
✅ Better IDE support and auto-completion

Part 2: Hands-On Exercises

Quick Start

1. Setup

cd modules/module-2/starter

# Install dependencies (includes BentoML 1.4+)
pip install -r ../requirements.txt

2. Complete Exercises

Exercise 1: Basic Service

Goal: Create a basic sentiment analysis API with BentoML services

# Run the service
bentoml serve service_basic:SentimentService

# Test it (macOS/Linux/WSL)
# Note: basic service takes a plain string — BentoML wraps it under the parameter name "text"
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "This is amazing!"}'

# Test it (Windows PowerShell)
$body = '{"text": "This is amazing!"}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Visit Swagger UI (macOS)
open http://localhost:3000
# Visit Swagger UI (Windows)
start http://localhost:3000

Key TODOs to Complete

TODO 1: Add @bentoml.service decorator to the class

# FILL IN: @bentoml.service(resources={"cpu": "2"}, traffic={"timeout": 30})
# Hint: Place the decorator directly above the class definition

TODO 2: Define __init__ method

# FILL IN: def __init__(self) -> None:
# Hint: This runs once at startup — the right place to load your model

TODO 3: Load the sentiment analysis pipeline

# FILL IN: self.pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
# Hint: Import pipeline from transformers at the top of the file

TODO 4: Add @bentoml.api decorator to the predict method

# FILL IN: @bentoml.api
# Hint: This exposes the method as an HTTP POST endpoint

TODO 5: Extract text and run prediction

# FILL IN: result = self.pipeline(text)
# Hint: self.pipeline accepts a string and returns a list of dicts

TODO 6: Return the first result from the prediction list

# FILL IN: return result[0]
# Hint: The pipeline always returns a list — grab index 0

Exercise 2: Input Validation & Production Features

Goal: Build a production-ready service with Pydantic validation, error handling, logging, and batch processing

# Run the service
bentoml serve service_with_validation:SentimentService

# Test valid input with tracking (macOS/Linux/WSL)
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "Amazing!", "request_id": "test-123"}}'

# Test valid input with tracking (Windows PowerShell)
$body = '{"request": {"text": "Amazing!", "request_id": "test-123"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Test invalid input (macOS/Linux/WSL)
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": ""}}'

# Test invalid input (Windows PowerShell)
$body = '{"request": {"text": ""}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Test batch prediction (macOS/Linux/WSL)
curl -X POST http://localhost:3000/batch_predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'

# Test batch prediction (Windows PowerShell)
$body = '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/batch_predict -ContentType "application/json" -Body $body

# Check health (macOS/Linux/WSL)
curl http://localhost:3000/health

# Check health (Windows PowerShell)
Invoke-RestMethod -Uri http://localhost:3000/health

# Visit Swagger UI (macOS)
open http://localhost:3000
# Visit Swagger UI (Windows)
start http://localhost:3000

# Watch logs for request tracking
# Look for: [test-123] Prediction successful with latency metrics

Part 1: Pydantic Validation — Key TODOs to Complete

TODO 1: Import Pydantic

# FILL IN: from pydantic import BaseModel, Field, field_validator

TODO 2: Import standard library modules for production features

# FILL IN: from typing import List, Optional
# FILL IN: import time, logging, uuid
# FILL IN: from datetime import datetime

TODO 3: Define the SentimentRequest model

# FILL IN: class SentimentRequest(BaseModel):
#     text: str = Field(..., min_length=1, max_length=5000, description="Text to analyse")
#     request_id: Optional[str] = Field(None, description="Optional request ID for tracing")

TODO 4: Add a custom validator for the text field (Pydantic v2 style)

# FILL IN: @field_validator('text')
# @classmethod
# def text_must_not_be_empty_or_whitespace(cls, v: str) -> str:
#     if not v or v.strip() == "":
#         raise ValueError('Text cannot be empty or just whitespace')
#     return v.strip()

TODO 5: Define the SentimentResponse model

# FILL IN: class SentimentResponse(BaseModel):
#     text: str
#     sentiment: str
#     confidence: float = Field(..., ge=0.0, le=1.0)
#     request_id: str
#     timestamp: str

TODO 6: Define the BatchSentimentRequest model

# FILL IN: class BatchSentimentRequest(BaseModel):
#     texts: List[str] = Field(..., min_length=1, max_length=100)
#     request_id: Optional[str] = None

TODO 7: Define the BatchSentimentResponse model

# FILL IN: class BatchSentimentResponse(BaseModel):
#     results: List[SentimentResponse]
#     metadata: dict
#     request_id: str

TODO 8: Define the ErrorResponse model

# FILL IN: class ErrorResponse(BaseModel):
#     error: str
#     message: str
#     request_id: str
#     timestamp: str

TODO 9: Configure logging

# FILL IN: logging.basicConfig(
#     level=logging.INFO,
#     format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
#     datefmt='%Y-%m-%d %H:%M:%S'
# )

TODO 10: Create a logger instance

# FILL IN: logger = logging.getLogger(__name__)

TODO 11: Implement generate_request_id()

# FILL IN: def generate_request_id(provided_id: Optional[str] = None) -> str:
#     if provided_id:
#         return provided_id
#     return str(uuid.uuid4())[:8]

TODO 12: Implement get_timestamp()

# FILL IN: def get_timestamp() -> str:
#     return datetime.utcnow().isoformat()

Part 2: Production Features — Key TODOs to Complete

TODO 13: Add @bentoml.service decorator to the class

# FILL IN: @bentoml.service(resources={"cpu": "2"}, traffic={"timeout": 30})

TODO 14: Load the pipeline in __init__

# FILL IN: self.pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

TODO 15: Log that the model is ready

# FILL IN: logger.info("Model loaded and ready")

TODO 16: Add @bentoml.api decorator to predict

# FILL IN: @bentoml.api

TODO 17: Log the incoming request

# FILL IN: logger.info(f"[{request_id}] Single prediction request")

TODO 18: Wrap the prediction logic in a try/except block

# FILL IN: try:
#     ...prediction logic...
# except Exception as e:
#     ...error handling...

TODO 19: Run the prediction

# FILL IN: result = self.pipeline(request.text)

TODO 20: Log the successful prediction with latency

# FILL IN: logger.info(f"[{request_id}] Prediction successful", extra={"latency_ms": round(latency, 2)})

TODO 21: Return a SentimentResponse with all fields

# FILL IN: return SentimentResponse(
#     text=request.text,
#     sentiment=result[0]['label'],
#     confidence=round(result[0]['score'], 4),
#     request_id=request_id,
#     timestamp=get_timestamp()
# )

TODO 22: Log the error with stack trace

# FILL IN: logger.error(f"[{request_id}] Prediction failed: {str(e)}", exc_info=True)

TODO 23: Return an error SentimentResponse

# FILL IN: return SentimentResponse(
#     text=request.text, sentiment="ERROR", confidence=0.0,
#     request_id=request_id, timestamp=get_timestamp()
# )

TODO 24: Add @bentoml.api decorator to batch_predict

# FILL IN: @bentoml.api

TODO 25: Add @bentoml.api and implement the health check

# FILL IN: @bentoml.api
# def health(self) -> dict:
#     return {"status": "healthy", "service": "sentiment_analysis", "timestamp": get_timestamp()}

Part 3: Testing & Validation

Testing Examples

Single Prediction

For service_basic BentoML wraps the str parameter under its argument name text.

# macOS/Linux/WSL — service_basic
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "This workshop is amazing!"}'

# Windows PowerShell — service_basic
$body = '{"text": "This workshop is amazing!"}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

For service_with_validation BentoML wraps the Pydantic model under the argument name request.

# macOS/Linux/WSL — service_with_validation
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "This workshop is amazing!", "request_id": "test-123"}}'

# Windows PowerShell — service_with_validation
$body = '{"request": {"text": "This workshop is amazing!", "request_id": "test-123"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

Response:

{
  "text": "This workshop is amazing!",
  "sentiment": "POSITIVE",
  "confidence": 0.9998,
  "request_id": "abc123"
}

Batch Prediction

# macOS/Linux/WSL
curl -X POST http://localhost:3000/batch_predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"texts": ["I loved it!", "Terrible experience.", "Pretty good overall."], "request_id": "batch-456"}}'

# Windows PowerShell
$body = '{"request": {"texts": ["I loved it!", "Terrible experience.", "Pretty good overall."], "request_id": "batch-456"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/batch_predict -ContentType "application/json" -Body $body

Response:

{
  "results": [
    {"text": "I loved it!", "sentiment": "POSITIVE", "confidence": 0.9995, ...},
    {"text": "Terrible experience.", "sentiment": "NEGATIVE", "confidence": 0.9991, ...},
    {"text": "Pretty good overall.", "sentiment": "POSITIVE", "confidence": 0.8876, ...}
  ],
  "metadata": {
    "count": 3,
    "latency_ms": 45.2,
    "throughput_per_sec": 66.4,
    "avg_latency_per_text_ms": 15.07
  },
  "request_id": "batch-456"
}

Error Response

Prediction error:

{
  "text": "test input",
  "sentiment": "ERROR",
  "confidence": 0.0,
  "request_id": "abc123"
}

Part 4: Deployment

Build Bento

Once your service is working, package it as a Bento for deployment:

# Build Bento (creates distributable package)
bentoml build

# List available Bentos
bentoml list

Containerization

Convert your Bento to a Docker container:

# Containerize the latest Bento
bentoml containerize sentiment_service:latest -t sentiment-api:v1

# Or specify a specific version
bentoml containerize sentiment_service:abc123 -t sentiment-api:v1.0.0

# List Docker images (macOS/Linux/WSL)
docker images | grep sentiment-api
# List Docker images (Windows PowerShell)
docker images | Select-String sentiment-api

Local Docker Testing

Test your containerized service locally before deploying to Kubernetes:

# Run container
docker run -p 3000:3000 sentiment-api:v1

# Test the containerized service (macOS/Linux/WSL)
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "Testing containerized service!"}}'

# Test the containerized service (Windows PowerShell)
$body = '{"request": {"text": "Testing containerized service!"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body
# Check health endpoint (macOS/Linux/WSL)
curl http://localhost:3000/health

# Check health endpoint (Windows PowerShell)
Invoke-RestMethod -Uri http://localhost:3000/health

# View container logs
docker logs <container-id>

# Stop container
docker stop <container-id>

Next: In Module 3, you'll deploy this container to Kubernetes!

Key Concepts Covered

BentoML Fundamentals

Service Classes: Define services with @bentoml.service decorator
Initialization: Load models in __init__() method (runs once at startup)
API Endpoints: Create routes with @bentoml.api decorator on methods
Type Hints: Use Python type hints for automatic I/O handling
Resource Config: Set CPU/memory requirements in decorator

Pydantic v2 Validation

Request Models: Type-safe input validation with BaseModel
Response Models: Structured output format
Field Constraints: Field() with min_length, max_length, ge, le
Custom Validators: @field_validator decorator (Pydantic v2)
Model Config: model_config dict with json_schema_extra

Production Patterns

Error Handling: Try/except with graceful error encoding
Logging: Structured logs with request IDs for tracing
Request Tracking: Unique IDs for debugging across services
Performance Metrics: Latency and throughput monitoring
Health Checks: /health endpoint for load balancers

Batch Processing

Batch Endpoints: Process multiple inputs efficiently
Metadata: Track performance metrics per batch
Throughput: 5-10x speedup vs individual requests
Error Handling: Graceful degradation for batch failures

Part 5: Troubleshooting

Troubleshooting

Issue 1: Port 3000 already in use

Symptoms:

Error: Address already in use
OSError: [Errno 48] Address already in use

Solutions:

Option 1: Use different port

bentoml serve service_with_validation:SentimentService --port 3001

Option 2: Kill existing process

# macOS/Linux
lsof -i :3000
kill -9 <PID>

# Windows PowerShell
netstat -ano | findstr :3000
# Note the PID from the last column, then:
taskkill /PID <PID> /F

Option 3: Find and stop BentoML service

# macOS/Linux: Kill all BentoML processes
pkill -f "bentoml serve"

# Or more targeted
ps aux | grep bentoml
kill <PID>

# Windows PowerShell
Get-Process | Where-Object { $_.CommandLine -like "*bentoml*" } | Stop-Process -Force
# Or use Task Manager to find and end the process

Issue 2: Import errors

Symptoms:

ModuleNotFoundError: No module named 'bentoml'
ImportError: cannot import name 'field_validator' from 'pydantic'

Solutions:

# Step 1: Activate virtual environment
source venv/bin/activate              # macOS / Linux / WSL
# Windows PowerShell: venv\Scripts\Activate.ps1
# Windows CMD:        venv\Scripts\activate.bat

# Step 2: Reinstall dependencies
pip install -r requirements.txt

# Step 3: Verify BentoML version (should be 1.4+)
pip show bentoml
# Version should be >= 1.4.0

# Step 4: Verify Pydantic version (should be v2)
pip show pydantic
# Version should be >= 2.0.0

# Step 5: Check Python version
python --version
# Should be >= 3.9

If issues persist:

# Clean install
pip uninstall bentoml pydantic -y
pip install --no-cache-dir bentoml>=1.4.0 pydantic>=2.0.0

Issue 3: Model not loading or downloading

Symptoms:

HTTPError: 404 Client Error
OSError: Can't load tokenizer for 'distilbert-base-uncased'

Solutions:

Check 1: Model downloads automatically on first run

# Just start the service
bentoml serve service_basic:SentimentService

# Model downloads to cache (may take 1-2 minutes first time)
# Location: ~/.cache/huggingface/hub/

Check 2: Verify cache location

# macOS/Linux/WSL
ls ~/.cache/huggingface/hub/
# Should show model files after first run

# Windows PowerShell
dir $env:USERPROFILE\.cache\huggingface\hub\

Check 3: Manual download (if network issues)

# Pre-download model
python -c "
from transformers import pipeline
model = pipeline('sentiment-analysis', model='distilbert-base-uncased')
print('Model downloaded!')
"

Check 4: Clear cache if corrupted

# macOS/Linux/WSL
rm -rf ~/.cache/huggingface/hub/
# Then restart service to re-download

# Windows PowerShell
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\"
# Then restart service to re-download

Still stuck? Check the solution files

Part 6: Reference

Commands Cheat Sheet

Quick Start

# Navigate to module
cd modules/module-2

# Install dependencies
pip install -r requirements.txt

# Serve basic service
cd starter
bentoml serve service_basic:SentimentService

# Serve with auto-reload (development)
bentoml serve service_with_validation:SentimentService --reload

# Serve on different port
bentoml serve service_with_validation:SentimentService --port 3001

Development Commands

# Serve with live reload (changes auto-reload)
bentoml serve service_with_validation:SentimentService --reload

# Serve with specific host
bentoml serve service_with_validation:SentimentService --host 0.0.0.0

# Serve with development mode (more verbose logging)
bentoml serve service_with_validation:SentimentService --reload --host 0.0.0.0 --port 3000

# View all serve options
bentoml serve --help

API Testing Commands

# macOS/Linux/WSL

# Single prediction — service_basic (str param, wrapped as {"text": "..."})
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "This is amazing!"}'

# Single prediction — service_with_validation (Pydantic model, wrapped as {"request": {...}})
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "This is amazing!"}}'

# Batch prediction (Pydantic model, wrapped as {"request": {...}})
curl -X POST http://localhost:3000/batch_predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'

# Health check
curl http://localhost:3000/health

# Get OpenAPI spec
curl http://localhost:3000/docs.json

# Visit Swagger UI (in browser)
open http://localhost:3000

# Windows PowerShell

# Single prediction — service_basic (str param, wrapped as {"text": "..."})
$body = '{"text": "This is amazing!"}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Single prediction — service_with_validation (Pydantic model, wrapped as {"request": {...}})
$body = '{"request": {"text": "This is amazing!"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Batch prediction (Pydantic model, wrapped as {"request": {...}})
$body = '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/batch_predict -ContentType "application/json" -Body $body

# Health check
Invoke-RestMethod -Uri http://localhost:3000/health

# Get OpenAPI spec
Invoke-RestMethod -Uri http://localhost:3000/docs.json

# Visit Swagger UI (in browser)
start http://localhost:3000

Bento Management

# Build Bento from bentofile.yaml
bentoml build

# List all Bentos
bentoml list

# Get Bento details
bentoml get sentiment_service:latest

# Delete specific Bento
bentoml delete sentiment_service:abc123

# Delete all versions of a Bento
bentoml delete sentiment_service --yes

# Export Bento to file
bentoml export sentiment_service:latest -o sentiment_service.bento

# Import Bento from file
bentoml import sentiment_service.bento

Containerization Commands

# Build Docker image from Bento
bentoml containerize sentiment_service:latest

# Build with custom tag
bentoml containerize sentiment_service:latest -t sentiment-api:v1.0.0

# Build with custom Dockerfile template
bentoml containerize sentiment_service:latest --dockerfile-template ./custom.Dockerfile

# Push to registry
docker tag sentiment_service:latest myregistry.com/sentiment-service:v1
docker push myregistry.com/sentiment-service:v1

Solution Files

If you get stuck, reference implementations are available in solution/:

service_basic.py - Exercise 1 completed
service_with_validation.py - Exercise 2 completed

Note: Try to complete exercises on your own first! Learning happens when you struggle a bit.

Advanced Challenges (Optional)

After completing all exercises, try these:

Add Caching: Implement response caching for repeated requests
Async Endpoints: Convert to async/await for better concurrency
Metrics Endpoint: Add /metrics endpoint for Prometheus
Custom Models: Replace with a different HuggingFace model
Multiple Endpoints: Add sentiment + topic classification

Next Steps

Once you've completed all exercises and tests pass:

→ Module 3: Kubernetes Deployment

In Module 3, you'll deploy this BentoML service to Kubernetes!

Having issues? Check the Troubleshooting section or review the solution files!

Navigation

Previous	Home	Next
← Module 1: Model Training & Experiment Tracking	🏠 Home	Module 3: Kubernetes Deployment →

Quick Links

MLOps Workshop | GitHub Repository

Module 2

Module 2: Model Packaging with BentoML 1.4+

What You'll Build

What You'll Learn

Why BentoML?

Learning Objectives

Part 1: Setup & Prerequisites

Prerequisites

Workshop Format

Part 2: Hands-On Exercises

Quick Start

1. Setup

2. Complete Exercises

Exercise 1: Basic Service

Exercise 2: Input Validation & Production Features

Part 3: Testing & Validation

Testing Examples

Single Prediction

Batch Prediction

Error Response

Part 4: Deployment

Build Bento

Containerization

Local Docker Testing

Key Concepts Covered

BentoML Fundamentals

Pydantic v2 Validation

Production Patterns

Batch Processing

Part 5: Troubleshooting

Troubleshooting

Issue 1: Port 3000 already in use

Issue 2: Import errors

Issue 3: Model not loading or downloading

Part 6: Reference

Commands Cheat Sheet

Quick Start

Development Commands

API Testing Commands

Bento Management

Containerization Commands

Solution Files

Advanced Challenges (Optional)

Next Steps

Navigation

Quick Links

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally