Production ML monitoring: drift detection (PSI/KL divergence), latency SLOs, and automated alerting with rollback triggers.
flowchart TB
subgraph Detectors["Drift Detection"]
PSI[PSI Detector]
KL[KL Divergence]
PRED[Prediction Drift]
end
subgraph Metrics["Metrics Collection"]
LAT[Latency Tracker]
PROM[Prometheus Exporter]
end
subgraph Alerts["Alerting"]
TRIGGER[Rollback Triggers]
RULES[Alert Rules]
end
API[FastAPI] --> Detectors
API --> Metrics
Metrics --> PROM
PROM --> TRIGGER
TRIGGER -->|threshold exceeded| RULES
RULES -->|critical| ROLLBACK[Auto Rollback]
| Algorithm | Purpose | Threshold |
|---|---|---|
| PSI | Feature distribution shift | < 0.1 (none), 0.1-0.2 (moderate), > 0.25 (critical) |
| KL Divergence | Distribution distance | < 0.1 (none), > 0.1 (drift) |
| JS Divergence | Symmetric, bounded [0,1] | Complementary metric |
- p50: Median latency
- p95: 95th percentile (SLO: < 100ms)
- p99: 99th percentile (SLO: < 500ms)
triggers = [
{"metric": "psi_score", "threshold": 0.25, "action": "alert"},
{"metric": "psi_score", "threshold": 0.5, "action": "rollback"},
{"metric": "latency_p99", "threshold": 0.5, "action": "scale_down"},
{"metric": "error_rate", "threshold": 0.05, "action": "rollback"},
]# Install dependencies
make install
# Run tests
make test
# Start service
make serve
# Or with Docker
make up| Endpoint | Method | Description |
|---|---|---|
/monitor |
POST | Submit prediction data |
/drift |
GET | Get drift status |
/latency |
GET | Get latency statistics |
/alerts |
GET | Get alert status |
/metrics |
GET | Prometheus metrics |
/health |
GET | Health check |
curl -X POST http://localhost:8000/monitor \
-H "Content-Type: application/json" \
-d '{
"features": {
"transaction_amount": 150.0,
"transaction_hour": 14,
"days_since_last_transaction": 1.5,
"transaction_count_24h": 3,
"distance_from_home": 5.0
},
"prediction": "legitimate",
"probability": 0.88,
"latency_ms": 12.5
}'curl http://localhost:8000/driftResponse:
{
"timestamp": "2024-01-15T10:30:00",
"features_monitored": 5,
"drift_detected": true,
"psi_scores": {
"transaction_amount": 0.08,
"distance_from_home": 0.32
},
"drifted_features": ["distance_from_home"]
}model-observability/
├── src/
│ ├── detectors/
│ │ ├── psi_detector.py # PSI calculation
│ │ ├── kl_detector.py # KL/JS divergence
│ │ └── prediction_drift.py # Output drift
│ ├── metrics/
│ │ ├── latency.py # p50/p95/p99 tracking
│ │ └── prometheus.py # Metrics export
│ ├── alerts/
│ │ └── rollback_trigger.py # Auto-rollback
│ └── api/
│ └── main.py # FastAPI app
├── dashboards/
│ └── drift_overview.json # Grafana dashboard
├── alerts/
│ └── prometheus_rules.yml # Alert definitions
├── tests/
├── docker-compose.yml
└── Makefile
Pre-configured alerts:
ModerateDataDrift: PSI 0.1-0.2 for 10mSignificantDataDrift: PSI 0.2-0.25 for 5mCriticalDataDrift: PSI > 0.25 for 2m (triggers rollback)HighP95Latency: p95 > 100ms for 5mCriticalP99Latency: p99 > 500ms for 2m (triggers rollback)
- "I implemented PSI with quantile-based binning for accurate distribution comparison"
- "The system tracks p50/p95/p99 latencies with configurable SLO thresholds"
- "Rollback triggers use a cooldown mechanism to prevent alert storms"
- "JS divergence provides a symmetric, bounded [0,1] metric for easier interpretation"
- Python 3.11+ / uv
- FastAPI + Pydantic
- NumPy + SciPy
- Prometheus + Grafana
- Docker
MIT