Skip to content

Module 2

rfashwall edited this page Apr 23, 2026 · 3 revisions

Module 2: Model Packaging with BentoML 1.4+

What You'll Build

By the end of this module, you'll have:

  • ✅ REST API serving ML predictions
  • ✅ Type-safe endpoints with automatic request/response validation
  • ✅ Batch processing capability for high-throughput scenarios
  • ✅ Comprehensive error handling and structured logging
  • ✅ Health check endpoints for load balancer integration
  • ✅ Swagger UI documentation auto-generated from your code
  • ✅ Container-ready service deployable to Kubernetes

What You'll Learn

Why BentoML?

BentoML simplifies ML model serving by providing:

Without BentoML With BentoML
Manual API boilerplate Automatic API generation
Custom serialization logic Built-in model packaging
Manual Docker setup One-command containerization
DIY health checks Production endpoints included
Complex deployment configs Simple bentofile.yaml
No automatic docs Auto-generated Swagger UI

Key Advantage: Focus on ML logic, not infrastructure plumbing.

Learning Objectives

By the end of this module, you will:

  • ✅ Package ML models as REST APIs using BentoML 1.4+ (class-based services)
  • ✅ Implement input validation with Pydantic v2
  • ✅ Add error handling and logging for production
  • ✅ Create batch processing endpoints
  • ✅ Build production-ready ML services with proper monitoring

Part 1: Setup & Prerequisites

Prerequisites

  • Completed Module 1
  • Python 3.9+ installed
  • Basic understanding of REST APIs
  • Basic knowledge of Python classes and decorators

Workshop Format

This module uses a scaffolded learning approach with BentoML 1.4+ API where you'll complete three progressive exercises:

Exercise 1: Basic BentoML Service
├─ Define service class with @bentoml.service
├─ Initialize model in __init__
├─ Create prediction endpoint with @bentoml.api
└─ Use Python type hints for I/O

Exercise 2: Validation & Production Features
├─ Part 1: Pydantic Validation
└─ Part 2: Production Features

Benefits of the new API:

  • ✅ Cleaner, more Pythonic class-based architecture
  • ✅ Better type safety with native Python type hints
  • ✅ Simpler model management (no separate save/load steps)
  • ✅ Automatic OpenAPI spec generation
  • ✅ Better IDE support and auto-completion

Part 2: Hands-On Exercises

Quick Start

1. Setup

cd modules/module-2/starter

# Install dependencies (includes BentoML 1.4+)
pip install -r ../requirements.txt

2. Complete Exercises

Exercise 1: Basic Service

Goal: Create a basic sentiment analysis API with BentoML services

# Run the service
bentoml serve service_basic:SentimentService

# Test it (macOS/Linux/WSL)
# Note: basic service takes a plain string — BentoML wraps it under the parameter name "text"
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "This is amazing!"}'

# Test it (Windows PowerShell)
$body = '{"text": "This is amazing!"}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Visit Swagger UI (macOS)
open http://localhost:3000
# Visit Swagger UI (Windows)
start http://localhost:3000

Key TODOs to Complete

TODO 1: Add @bentoml.service decorator to the class

# FILL IN: @bentoml.service(resources={"cpu": "2"}, traffic={"timeout": 30})
# Hint: Place the decorator directly above the class definition

TODO 2: Define __init__ method

# FILL IN: def __init__(self) -> None:
# Hint: This runs once at startup — the right place to load your model

TODO 3: Load the sentiment analysis pipeline

# FILL IN: self.pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")
# Hint: Import pipeline from transformers at the top of the file

TODO 4: Add @bentoml.api decorator to the predict method

# FILL IN: @bentoml.api
# Hint: This exposes the method as an HTTP POST endpoint

TODO 5: Extract text and run prediction

# FILL IN: result = self.pipeline(text)
# Hint: self.pipeline accepts a string and returns a list of dicts

TODO 6: Return the first result from the prediction list

# FILL IN: return result[0]
# Hint: The pipeline always returns a list — grab index 0

Exercise 2: Input Validation & Production Features

Goal: Build a production-ready service with Pydantic validation, error handling, logging, and batch processing

# Run the service
bentoml serve service_with_validation:SentimentService

# Test valid input with tracking (macOS/Linux/WSL)
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "Amazing!", "request_id": "test-123"}}'

# Test valid input with tracking (Windows PowerShell)
$body = '{"request": {"text": "Amazing!", "request_id": "test-123"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Test invalid input (macOS/Linux/WSL)
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": ""}}'

# Test invalid input (Windows PowerShell)
$body = '{"request": {"text": ""}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Test batch prediction (macOS/Linux/WSL)
curl -X POST http://localhost:3000/batch_predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'

# Test batch prediction (Windows PowerShell)
$body = '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/batch_predict -ContentType "application/json" -Body $body

# Check health (macOS/Linux/WSL)
curl http://localhost:3000/health

# Check health (Windows PowerShell)
Invoke-RestMethod -Uri http://localhost:3000/health

# Visit Swagger UI (macOS)
open http://localhost:3000
# Visit Swagger UI (Windows)
start http://localhost:3000

# Watch logs for request tracking
# Look for: [test-123] Prediction successful with latency metrics

Part 1: Pydantic Validation — Key TODOs to Complete

TODO 1: Import Pydantic

# FILL IN: from pydantic import BaseModel, Field, field_validator

TODO 2: Import standard library modules for production features

# FILL IN: from typing import List, Optional
# FILL IN: import time, logging, uuid
# FILL IN: from datetime import datetime

TODO 3: Define the SentimentRequest model

# FILL IN: class SentimentRequest(BaseModel):
#     text: str = Field(..., min_length=1, max_length=5000, description="Text to analyse")
#     request_id: Optional[str] = Field(None, description="Optional request ID for tracing")

TODO 4: Add a custom validator for the text field (Pydantic v2 style)

# FILL IN: @field_validator('text')
# @classmethod
# def text_must_not_be_empty_or_whitespace(cls, v: str) -> str:
#     if not v or v.strip() == "":
#         raise ValueError('Text cannot be empty or just whitespace')
#     return v.strip()

TODO 5: Define the SentimentResponse model

# FILL IN: class SentimentResponse(BaseModel):
#     text: str
#     sentiment: str
#     confidence: float = Field(..., ge=0.0, le=1.0)
#     request_id: str
#     timestamp: str

TODO 6: Define the BatchSentimentRequest model

# FILL IN: class BatchSentimentRequest(BaseModel):
#     texts: List[str] = Field(..., min_length=1, max_length=100)
#     request_id: Optional[str] = None

TODO 7: Define the BatchSentimentResponse model

# FILL IN: class BatchSentimentResponse(BaseModel):
#     results: List[SentimentResponse]
#     metadata: dict
#     request_id: str

TODO 8: Define the ErrorResponse model

# FILL IN: class ErrorResponse(BaseModel):
#     error: str
#     message: str
#     request_id: str
#     timestamp: str

TODO 9: Configure logging

# FILL IN: logging.basicConfig(
#     level=logging.INFO,
#     format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
#     datefmt='%Y-%m-%d %H:%M:%S'
# )

TODO 10: Create a logger instance

# FILL IN: logger = logging.getLogger(__name__)

TODO 11: Implement generate_request_id()

# FILL IN: def generate_request_id(provided_id: Optional[str] = None) -> str:
#     if provided_id:
#         return provided_id
#     return str(uuid.uuid4())[:8]

TODO 12: Implement get_timestamp()

# FILL IN: def get_timestamp() -> str:
#     return datetime.utcnow().isoformat()

Part 2: Production Features — Key TODOs to Complete

TODO 13: Add @bentoml.service decorator to the class

# FILL IN: @bentoml.service(resources={"cpu": "2"}, traffic={"timeout": 30})

TODO 14: Load the pipeline in __init__

# FILL IN: self.pipeline = pipeline("sentiment-analysis", model="distilbert-base-uncased-finetuned-sst-2-english")

TODO 15: Log that the model is ready

# FILL IN: logger.info("Model loaded and ready")

TODO 16: Add @bentoml.api decorator to predict

# FILL IN: @bentoml.api

TODO 17: Log the incoming request

# FILL IN: logger.info(f"[{request_id}] Single prediction request")

TODO 18: Wrap the prediction logic in a try/except block

# FILL IN: try:
#     ...prediction logic...
# except Exception as e:
#     ...error handling...

TODO 19: Run the prediction

# FILL IN: result = self.pipeline(request.text)

TODO 20: Log the successful prediction with latency

# FILL IN: logger.info(f"[{request_id}] Prediction successful", extra={"latency_ms": round(latency, 2)})

TODO 21: Return a SentimentResponse with all fields

# FILL IN: return SentimentResponse(
#     text=request.text,
#     sentiment=result[0]['label'],
#     confidence=round(result[0]['score'], 4),
#     request_id=request_id,
#     timestamp=get_timestamp()
# )

TODO 22: Log the error with stack trace

# FILL IN: logger.error(f"[{request_id}] Prediction failed: {str(e)}", exc_info=True)

TODO 23: Return an error SentimentResponse

# FILL IN: return SentimentResponse(
#     text=request.text, sentiment="ERROR", confidence=0.0,
#     request_id=request_id, timestamp=get_timestamp()
# )

TODO 24: Add @bentoml.api decorator to batch_predict

# FILL IN: @bentoml.api

TODO 25: Add @bentoml.api and implement the health check

# FILL IN: @bentoml.api
# def health(self) -> dict:
#     return {"status": "healthy", "service": "sentiment_analysis", "timestamp": get_timestamp()}

Part 3: Testing & Validation

Testing Examples

Single Prediction

For service_basic BentoML wraps the str parameter under its argument name text.

# macOS/Linux/WSL — service_basic
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "This workshop is amazing!"}'
# Windows PowerShell — service_basic
$body = '{"text": "This workshop is amazing!"}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

For service_with_validation BentoML wraps the Pydantic model under the argument name request.

# macOS/Linux/WSL — service_with_validation
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "This workshop is amazing!", "request_id": "test-123"}}'
# Windows PowerShell — service_with_validation
$body = '{"request": {"text": "This workshop is amazing!", "request_id": "test-123"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

Response:

{
  "text": "This workshop is amazing!",
  "sentiment": "POSITIVE",
  "confidence": 0.9998,
  "request_id": "abc123"
}

Batch Prediction

# macOS/Linux/WSL
curl -X POST http://localhost:3000/batch_predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"texts": ["I loved it!", "Terrible experience.", "Pretty good overall."], "request_id": "batch-456"}}'
# Windows PowerShell
$body = '{"request": {"texts": ["I loved it!", "Terrible experience.", "Pretty good overall."], "request_id": "batch-456"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/batch_predict -ContentType "application/json" -Body $body

Response:

{
  "results": [
    {"text": "I loved it!", "sentiment": "POSITIVE", "confidence": 0.9995, ...},
    {"text": "Terrible experience.", "sentiment": "NEGATIVE", "confidence": 0.9991, ...},
    {"text": "Pretty good overall.", "sentiment": "POSITIVE", "confidence": 0.8876, ...}
  ],
  "metadata": {
    "count": 3,
    "latency_ms": 45.2,
    "throughput_per_sec": 66.4,
    "avg_latency_per_text_ms": 15.07
  },
  "request_id": "batch-456"
}

Error Response

Prediction error:

{
  "text": "test input",
  "sentiment": "ERROR",
  "confidence": 0.0,
  "request_id": "abc123"
}

Part 4: Deployment

Build Bento

Once your service is working, package it as a Bento for deployment:

# Build Bento (creates distributable package)
bentoml build

# List available Bentos
bentoml list

Containerization

Convert your Bento to a Docker container:

# Containerize the latest Bento
bentoml containerize sentiment_service:latest -t sentiment-api:v1

# Or specify a specific version
bentoml containerize sentiment_service:abc123 -t sentiment-api:v1.0.0

# List Docker images (macOS/Linux/WSL)
docker images | grep sentiment-api
# List Docker images (Windows PowerShell)
docker images | Select-String sentiment-api

Local Docker Testing

Test your containerized service locally before deploying to Kubernetes:

# Run container
docker run -p 3000:3000 sentiment-api:v1

# Test the containerized service (macOS/Linux/WSL)
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "Testing containerized service!"}}'

# Test the containerized service (Windows PowerShell)
$body = '{"request": {"text": "Testing containerized service!"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body
# Check health endpoint (macOS/Linux/WSL)
curl http://localhost:3000/health

# Check health endpoint (Windows PowerShell)
Invoke-RestMethod -Uri http://localhost:3000/health

# View container logs
docker logs <container-id>

# Stop container
docker stop <container-id>

Next: In Module 3, you'll deploy this container to Kubernetes!


Key Concepts Covered

BentoML Fundamentals

  • Service Classes: Define services with @bentoml.service decorator
  • Initialization: Load models in __init__() method (runs once at startup)
  • API Endpoints: Create routes with @bentoml.api decorator on methods
  • Type Hints: Use Python type hints for automatic I/O handling
  • Resource Config: Set CPU/memory requirements in decorator

Pydantic v2 Validation

  • Request Models: Type-safe input validation with BaseModel
  • Response Models: Structured output format
  • Field Constraints: Field() with min_length, max_length, ge, le
  • Custom Validators: @field_validator decorator (Pydantic v2)
  • Model Config: model_config dict with json_schema_extra

Production Patterns

  • Error Handling: Try/except with graceful error encoding
  • Logging: Structured logs with request IDs for tracing
  • Request Tracking: Unique IDs for debugging across services
  • Performance Metrics: Latency and throughput monitoring
  • Health Checks: /health endpoint for load balancers

Batch Processing

  • Batch Endpoints: Process multiple inputs efficiently
  • Metadata: Track performance metrics per batch
  • Throughput: 5-10x speedup vs individual requests
  • Error Handling: Graceful degradation for batch failures

Part 5: Troubleshooting

Troubleshooting

Issue 1: Port 3000 already in use

Symptoms:

Error: Address already in use
OSError: [Errno 48] Address already in use

Solutions:

Option 1: Use different port

bentoml serve service_with_validation:SentimentService --port 3001

Option 2: Kill existing process

# macOS/Linux
lsof -i :3000
kill -9 <PID>
# Windows PowerShell
netstat -ano | findstr :3000
# Note the PID from the last column, then:
taskkill /PID <PID> /F

Option 3: Find and stop BentoML service

# macOS/Linux: Kill all BentoML processes
pkill -f "bentoml serve"

# Or more targeted
ps aux | grep bentoml
kill <PID>
# Windows PowerShell
Get-Process | Where-Object { $_.CommandLine -like "*bentoml*" } | Stop-Process -Force
# Or use Task Manager to find and end the process

Issue 2: Import errors

Symptoms:

ModuleNotFoundError: No module named 'bentoml'
ImportError: cannot import name 'field_validator' from 'pydantic'

Solutions:

# Step 1: Activate virtual environment
source venv/bin/activate              # macOS / Linux / WSL
# Windows PowerShell: venv\Scripts\Activate.ps1
# Windows CMD:        venv\Scripts\activate.bat

# Step 2: Reinstall dependencies
pip install -r requirements.txt

# Step 3: Verify BentoML version (should be 1.4+)
pip show bentoml
# Version should be >= 1.4.0

# Step 4: Verify Pydantic version (should be v2)
pip show pydantic
# Version should be >= 2.0.0

# Step 5: Check Python version
python --version
# Should be >= 3.9

If issues persist:

# Clean install
pip uninstall bentoml pydantic -y
pip install --no-cache-dir bentoml>=1.4.0 pydantic>=2.0.0

Issue 3: Model not loading or downloading

Symptoms:

HTTPError: 404 Client Error
OSError: Can't load tokenizer for 'distilbert-base-uncased'

Solutions:

Check 1: Model downloads automatically on first run

# Just start the service
bentoml serve service_basic:SentimentService

# Model downloads to cache (may take 1-2 minutes first time)
# Location: ~/.cache/huggingface/hub/

Check 2: Verify cache location

# macOS/Linux/WSL
ls ~/.cache/huggingface/hub/
# Should show model files after first run
# Windows PowerShell
dir $env:USERPROFILE\.cache\huggingface\hub\

Check 3: Manual download (if network issues)

# Pre-download model
python -c "
from transformers import pipeline
model = pipeline('sentiment-analysis', model='distilbert-base-uncased')
print('Model downloaded!')
"

Check 4: Clear cache if corrupted

# macOS/Linux/WSL
rm -rf ~/.cache/huggingface/hub/
# Then restart service to re-download
# Windows PowerShell
Remove-Item -Recurse -Force "$env:USERPROFILE\.cache\huggingface\hub\"
# Then restart service to re-download

Still stuck? Check the solution files


Part 6: Reference

Commands Cheat Sheet

Quick Start

# Navigate to module
cd modules/module-2

# Install dependencies
pip install -r requirements.txt

# Serve basic service
cd starter
bentoml serve service_basic:SentimentService

# Serve with auto-reload (development)
bentoml serve service_with_validation:SentimentService --reload

# Serve on different port
bentoml serve service_with_validation:SentimentService --port 3001

Development Commands

# Serve with live reload (changes auto-reload)
bentoml serve service_with_validation:SentimentService --reload

# Serve with specific host
bentoml serve service_with_validation:SentimentService --host 0.0.0.0

# Serve with development mode (more verbose logging)
bentoml serve service_with_validation:SentimentService --reload --host 0.0.0.0 --port 3000

# View all serve options
bentoml serve --help

API Testing Commands

# macOS/Linux/WSL

# Single prediction — service_basic (str param, wrapped as {"text": "..."})
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"text": "This is amazing!"}'

# Single prediction — service_with_validation (Pydantic model, wrapped as {"request": {...}})
curl -X POST http://localhost:3000/predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"text": "This is amazing!"}}'

# Batch prediction (Pydantic model, wrapped as {"request": {...}})
curl -X POST http://localhost:3000/batch_predict \
     -H "Content-Type: application/json" \
     -d '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'

# Health check
curl http://localhost:3000/health

# Get OpenAPI spec
curl http://localhost:3000/docs.json

# Visit Swagger UI (in browser)
open http://localhost:3000
# Windows PowerShell

# Single prediction — service_basic (str param, wrapped as {"text": "..."})
$body = '{"text": "This is amazing!"}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Single prediction — service_with_validation (Pydantic model, wrapped as {"request": {...}})
$body = '{"request": {"text": "This is amazing!"}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/predict -ContentType "application/json" -Body $body

# Batch prediction (Pydantic model, wrapped as {"request": {...}})
$body = '{"request": {"texts": ["Great!", "Terrible", "Okay"]}}'
Invoke-RestMethod -Method Post -Uri http://localhost:3000/batch_predict -ContentType "application/json" -Body $body

# Health check
Invoke-RestMethod -Uri http://localhost:3000/health

# Get OpenAPI spec
Invoke-RestMethod -Uri http://localhost:3000/docs.json

# Visit Swagger UI (in browser)
start http://localhost:3000

Bento Management

# Build Bento from bentofile.yaml
bentoml build

# List all Bentos
bentoml list

# Get Bento details
bentoml get sentiment_service:latest

# Delete specific Bento
bentoml delete sentiment_service:abc123

# Delete all versions of a Bento
bentoml delete sentiment_service --yes

# Export Bento to file
bentoml export sentiment_service:latest -o sentiment_service.bento

# Import Bento from file
bentoml import sentiment_service.bento

Containerization Commands

# Build Docker image from Bento
bentoml containerize sentiment_service:latest

# Build with custom tag
bentoml containerize sentiment_service:latest -t sentiment-api:v1.0.0

# Build with custom Dockerfile template
bentoml containerize sentiment_service:latest --dockerfile-template ./custom.Dockerfile

# Push to registry
docker tag sentiment_service:latest myregistry.com/sentiment-service:v1
docker push myregistry.com/sentiment-service:v1

Solution Files

If you get stuck, reference implementations are available in solution/:

  • service_basic.py - Exercise 1 completed
  • service_with_validation.py - Exercise 2 completed

Note: Try to complete exercises on your own first! Learning happens when you struggle a bit.

Advanced Challenges (Optional)

After completing all exercises, try these:

  1. Add Caching: Implement response caching for repeated requests
  2. Async Endpoints: Convert to async/await for better concurrency
  3. Metrics Endpoint: Add /metrics endpoint for Prometheus
  4. Custom Models: Replace with a different HuggingFace model
  5. Multiple Endpoints: Add sentiment + topic classification

Next Steps

Once you've completed all exercises and tests pass:

Module 3: Kubernetes Deployment

In Module 3, you'll deploy this BentoML service to Kubernetes!


Having issues? Check the Troubleshooting section or review the solution files!


Navigation

Previous Home Next
Module 1: Model Training & Experiment Tracking 🏠 Home Module 3: Kubernetes Deployment

Quick Links


MLOps Workshop | GitHub Repository

Clone this wiki locally