Skip to content

YZXBiz/model-observability

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Model Observability

CI

Production ML monitoring: drift detection (PSI/KL divergence), latency SLOs, and automated alerting with rollback triggers.

Architecture

flowchart TB
    subgraph Detectors["Drift Detection"]
        PSI[PSI Detector]
        KL[KL Divergence]
        PRED[Prediction Drift]
    end

    subgraph Metrics["Metrics Collection"]
        LAT[Latency Tracker]
        PROM[Prometheus Exporter]
    end

    subgraph Alerts["Alerting"]
        TRIGGER[Rollback Triggers]
        RULES[Alert Rules]
    end

    API[FastAPI] --> Detectors
    API --> Metrics
    Metrics --> PROM
    PROM --> TRIGGER
    TRIGGER -->|threshold exceeded| RULES
    RULES -->|critical| ROLLBACK[Auto Rollback]
Loading

Features

Drift Detection

Algorithm Purpose Threshold
PSI Feature distribution shift < 0.1 (none), 0.1-0.2 (moderate), > 0.25 (critical)
KL Divergence Distribution distance < 0.1 (none), > 0.1 (drift)
JS Divergence Symmetric, bounded [0,1] Complementary metric

Latency SLOs

  • p50: Median latency
  • p95: 95th percentile (SLO: < 100ms)
  • p99: 99th percentile (SLO: < 500ms)

Automated Rollback Triggers

triggers = [
    {"metric": "psi_score", "threshold": 0.25, "action": "alert"},
    {"metric": "psi_score", "threshold": 0.5, "action": "rollback"},
    {"metric": "latency_p99", "threshold": 0.5, "action": "scale_down"},
    {"metric": "error_rate", "threshold": 0.05, "action": "rollback"},
]

Quick Start

# Install dependencies
make install

# Run tests
make test

# Start service
make serve

# Or with Docker
make up

API Endpoints

Endpoint Method Description
/monitor POST Submit prediction data
/drift GET Get drift status
/latency GET Get latency statistics
/alerts GET Get alert status
/metrics GET Prometheus metrics
/health GET Health check

Submit Prediction Data

curl -X POST http://localhost:8000/monitor \
  -H "Content-Type: application/json" \
  -d '{
    "features": {
      "transaction_amount": 150.0,
      "transaction_hour": 14,
      "days_since_last_transaction": 1.5,
      "transaction_count_24h": 3,
      "distance_from_home": 5.0
    },
    "prediction": "legitimate",
    "probability": 0.88,
    "latency_ms": 12.5
  }'

Check Drift Status

curl http://localhost:8000/drift

Response:

{
  "timestamp": "2024-01-15T10:30:00",
  "features_monitored": 5,
  "drift_detected": true,
  "psi_scores": {
    "transaction_amount": 0.08,
    "distance_from_home": 0.32
  },
  "drifted_features": ["distance_from_home"]
}

Project Structure

model-observability/
├── src/
│   ├── detectors/
│   │   ├── psi_detector.py      # PSI calculation
│   │   ├── kl_detector.py       # KL/JS divergence
│   │   └── prediction_drift.py  # Output drift
│   ├── metrics/
│   │   ├── latency.py           # p50/p95/p99 tracking
│   │   └── prometheus.py        # Metrics export
│   ├── alerts/
│   │   └── rollback_trigger.py  # Auto-rollback
│   └── api/
│       └── main.py              # FastAPI app
├── dashboards/
│   └── drift_overview.json      # Grafana dashboard
├── alerts/
│   └── prometheus_rules.yml     # Alert definitions
├── tests/
├── docker-compose.yml
└── Makefile

Prometheus Alerts

Pre-configured alerts:

  • ModerateDataDrift: PSI 0.1-0.2 for 10m
  • SignificantDataDrift: PSI 0.2-0.25 for 5m
  • CriticalDataDrift: PSI > 0.25 for 2m (triggers rollback)
  • HighP95Latency: p95 > 100ms for 5m
  • CriticalP99Latency: p99 > 500ms for 2m (triggers rollback)

Interview Talking Points

  • "I implemented PSI with quantile-based binning for accurate distribution comparison"
  • "The system tracks p50/p95/p99 latencies with configurable SLO thresholds"
  • "Rollback triggers use a cooldown mechanism to prevent alert storms"
  • "JS divergence provides a symmetric, bounded [0,1] metric for easier interpretation"

Tech Stack

  • Python 3.11+ / uv
  • FastAPI + Pydantic
  • NumPy + SciPy
  • Prometheus + Grafana
  • Docker

License

MIT

About

Production ML monitoring: drift detection (PSI/KL), latency SLOs, and automated alerting

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors