A hands-on, project-based guide to Machine Learning Operations built specifically for DevOps, Platform, and SRE engineers.
No ML background required. Every concept is explained through DevOps analogies you already understand.
If you are completely new to MLOps, read our DevOps to MLOps guide first.
- Who This Is For
- What We Build
- Prerequisites
- Phase 1: Local Dev & Pipelines
- Phase 2: Enterprise Orchestration for ML
- Learning Path
- Tech Stack
- Recommended Reading
- License
Most MLOps resources are written for data scientists learning infrastructure. This repo flips that.
You do not need to become a data scientist. But just like understanding how a Java application is built makes you a better DevOps engineer, understanding how an ML model is built, trained, and served makes you effective at operating ML workloads in production.
| Track | What You Learn |
|---|---|
| 🤖 Traditional ML | Train, serve, automate, and monitor a real ML model on Kubernetes |
| 🧠 Foundational Models | Serve LLMs in production using vLLM, TGI, and Ollama |
| ⚙️ LLM-Powered DevOps | Monitor K8s clusters, build RAG pipelines and agents with LLMs |
Everything runs on Kubernetes, Docker, and tools you already use.
| Skill | Level |
|---|---|
| Kubernetes | Intermediate |
| AWS EKS | Working knowledge |
| Python | Basic (read and run scripts) |
No ML experience needed. That is what this repo teaches.
Goal: Build the required ML foundation for by building an Employee attrition prediction model from your local systems.
Use case throughout: Employee attrition prediction for a large organisation (~500,000 employees). One problem, end to end. Keeps the focus on infrastructure and operations, not data science theory.
| Step | Title | Guide |
|---|---|---|
| 1 | Project Dataset Pipeline | Read the Guide |
| 2 | Data Preparation Stages | Read the Guide |
| 3 | Training & Building the Prediction Model | Read the Guide |
| 4 | From Model to Live API with KServe | Read the Guide |
Code: phase-1-local-dev/
Goal: Replace local, manual ML workflows with production-grade orchestration. Versioned data, automated pipelines, experiment tracking, and scalable training.
| Step | Title | Guide |
|---|---|---|
| 1 | Data Versioning Fundamentals | Read the Guide |
| 2 | Data Version Control (DVC) with AWS S3 | Read the Guide |
| 3 | Data Versioning using Airflow on Kubernetes | Read The Guide |
| 4 | Feature Store Fundamentals Explained | Read The Guide |
| 5 | Hands-on Feature Store with Feast on Kubernetes | Read The Guide |
| 6 | Kubeflow Explained for MLOps | 🔜 Coming Next |
| 7 | Hands-on Kubeflow on Kubernetes | 🔜 Planned |
| 8 | MLflow Explained for MLOps | 🔜 Planned |
Code: phase-2-enterprise-setup/
| Phase | Track | Title | Status |
|---|---|---|---|
| 1 | 🤖 Traditional ML | Local Dev & Pipelines | ✅ Done |
| 1 | 🤖 Traditional ML | K8s Deploy & Model Serving | ✅ Done |
| 3 | 🤖 Traditional ML | Enterprise Orchestration | 🔄 In Progress |
| 4 | 🤖 Traditional ML | Monitor & Observe | 🔜 Planned |
| 5 | 🧠 Foundational Models | Foundational Models | 🔜 Planned |
| 6 | 🧠 Foundational Models | LLM Serving & Scaling | 🔜 Planned |
| 7 | ⚙️ LLM-Powered DevOps | LLM-Powered DevOps | 🔜 Planned |
| 8 | ⚙️ LLM-Powered DevOps | Emerging AI Ops | 🔜 Planned |
Here is the tech stack you will be using in this setup.
| Category | Tools |
|---|---|
| Data Pipeline | Python, Airflow |
| Model Training | scikit-learn |
| API / Serving | FastAPI, Flask, Docker, KServe |
| ML Orchestration | Kubeflow, MLflow Pipelines |
| Monitoring | Prometheus, Grafana, Evidently AI |
| Infrastructure | Kubernetes, Helm, GitHub Actions |
- Ray: Open-source distributed computing framework For Python & AI Workloads
- rtk: High-performance CLI proxy that reduces LLM token consumption.
- CML: CI/CD for Machine Learning Projects
Dual licensed:
- Code (scripts, configs, manifests) — Apache 2.0
- Content (README, guides, docs) — All Rights Reserved
For commercial licensing: contact@devopscube.com