1. Developers vs. Operations (Ops / DevOps / SRE)
| Category | Developers | Operations (Ops / DevOps / SRE) |
|---|---|---|
| Primary Role | Design, write, test, refactor application and service code | Provision, deploy, operate, observe, secure, and scale infrastructure and platforms |
| Key Responsibilities | - Implement features - Fix defects - Refactor & improve code quality - Write tests & documentation - Optimize performance at code level |
- Automate infrastructure provisioning (IaC) - Deploy releases & manage environments - Monitor availability, latency, cost, security posture - Capacity planning & scaling - Incident response & root cause analysis |
| Typical Outputs | Source code, unit/integration tests, build artifacts, API definitions | Infrastructure definitions (Terraform, CloudFormation), deployment pipelines, runbooks, dashboards, alerts |
| Scope & Scale | From small scripts to globally distributed microservices | From a single VM stack to multi-region, multi-account cloud platforms |
| Collaboration Pattern | Provide deployable, testable artifacts; supply runtime requirements | Provide reliable platforms & feedback loops; enable self-service deployment and observability |
| Example Activities | - Implement REST endpoint - Optimize algorithm - Add feature flag |
- Create blue/green deployment strategy - Configure Prometheus alerts - Harden IAM policies |
| Example Tools | Git, GitHub/GitLab, IDEs (VS Code, IntelliJ, PyCharm), Docker (dev), Test frameworks, Feature flag systems | AWS / Azure / GCP, Terraform / Pulumi / CloudFormation, Ansible / Puppet / Chef, Kubernetes / EKS, Argo CD / Flux, Prometheus, Grafana, Datadog, Security scanners, Incident tools (PagerDuty) |
Note: Modern high performing teams blur these boundaries; developers own more of the delivery and runtime quality; operations engineers build platforms and guardrails rather than performing every manual deployment.
2. What Is DevOps?
-
DevOps is a cultural philosophy plus a collection of engineering practices that integrate software development and operations to deliver value rapidly, safely, and sustainably.
-
Core dimensions:
- Culture & Collaboration (shared goals, reduced friction, psychological safety)
- Automation (build, test, provision, deploy, remediation)
- Continuous Integration & Continuous Delivery (CI/CD)
- Observability & Feedback (metrics, logs, traces, user telemetry)
- Lean Flow (small batch sizes, fast feedback loops)
- Reliability Engineering (SLOs, error budgets, resilience testing)
- Security Shift Left (DevSecOps: integrating security earlier)
- Continuous Learning & Improvement (post-incident reviews, experimentation)
3. Silo Mentality
- Silo mentality is a mindset where teams optimize locally, hoard information, or protect processes, inhibiting organizational learning and speed.
| Aspect | Description |
|---|---|
| Causes | Misaligned incentives, legacy org charts, unclear ownership, fear of blame, tool fragmentation |
| Symptoms | Slow handoffs; "throw over the wall" deploys; duplicated scripts; inconsistent environments; delayed incident resolution |
| Impacts | Longer lead times, higher change failure rate, brittle releases, shadow tooling, low morale |
| DevOps Remedies | Cross-functional teams, shared KPIs (e.g., DORA metrics), blameless postmortems, platform APIs/self service, documentation as code, inner source repositories |
| Practical Actions | Standardize pipelines, central IaC modules, chatops for transparency, session tagging & access logs for shared visibility, embed Ops into feature teams temporarily |
4. Why DevOps?
Pre DevOps / siloed approaches often exhibited:
- Long, serialized phases (code → wait weeks → test → wait → deploy)
- Manual, error prone deployments & environment drift
- Fragile big-bang releases (high blast radius)
- Limited, delayed feedback (production errors discovered by users)
- Security & compliance bolted on late
- Low deployment frequency + high change failure rate
DevOps counters these with:
- Small, frequent, automated changes
- Immutable infrastructure & versioned environments
- Continuous testing & security scanning
- Observable systems with rapid detection & rollback
- Shared accountability (you build it, you help run it)
- Faster mean time to recovery (MTTR)
5. When to Adopt DevOps
Appropriate When (almost ubiquitous today):
- Rapid iteration is needed (SaaS, mobile backends, data platforms)
- Microservices/event driven/modular architectures
- Multi environment consistency (dev/test/stage/prod)
- Infrastructure elasticity (cloud, containers, serverless)
- Need to reduce lead time & increase deployment frequency
Special Considerations (NOT blanket "do not use"):
- Highly regulated, safety critical domains (finance, healthcare, energy) still benefit, but require:
- Strong change governance as code (policy as code, approvals embedded in pipelines)
- Segregation of duties implemented via role based & attribute based access controls (not manual silos)
- Immutable audit trails (CloudTrail, pipeline logs)
- Automated evidence collection for compliance
Avoid anti pattern: Using "mission critical" as justification to retain manual, brittle processes; automation with proper controls improves reliability and auditability.
6. DevOps Lifecycle (Conceptual Flow)
- (Plan) → Code → Build → Test → (Secure) → Package → Release → Deploy → Operate → Monitor → (Learn / Improve)
- Continuous Development
- Continuous Integration
- Continuous Testing
- Continuous Deployment/Delivery
- Continuous Monitoring & Feedback (+ Continuous Security & Governance embedded in each stage)
7. Stage 1: Continuous Development
- Activities: Planning, requirements sharing, iterative coding, branching strategies (trunk-based or short-lived feature branches).
- Tools: Git, GitHub/GitLab, Issue trackers (Jira, GitHub Issues), Architecture-as-code diagrams (Mermaid, PlantUML).
- Practices:
- Small commits tied to user stories.
- Feature flags for progressive delivery.
- Code review automation (lint, static analysis).
Example:
# Trunk-based model with short-lived feature branch
git checkout -b feat/payment-retry
# Implement & test
git commit -m "feat: add idempotent payment retry logic"
git push origin feat/payment-retry
# Open PR -> CI runs tests & static analysis -> Merge quickly
8. Stage 2: Continuous Integration (CI)
-
Purpose:
- Ensure that new code integrates cleanly with the mainline several times per day.
-
Triggers:
- Every push/pull request.
-
Automated Steps:
- Compile/build, unit tests, static code analysis, security scanning (SAST), dependency checks (SCA).
-
Tools:
- GitHub Actions, Jenkins, GitLab CI, CircleCI, Azure Pipelines.
-
Pipeline Example:
events: push -> build (compile) -> unit tests -> lint -> security scan -> package artifact (container / zip) -> publish to artifact registry
- Example AWS Relevant:
- Build container images in CodeBuild or GitHub Actions → push to Amazon ECR → sign with Notation / Cosign.
- Leverage OIDC for CI to assume AWS roles without static credentials.
9. Stage 3: Continuous Testing
-
Focus: Automated quality gates (unit, integration, contract, performance, security, accessibility).
-
Tools: Selenium / Playwright (UI), TestNG / JUnit (Java), PyTest, k6 / JMeter (performance), OPA / Checkov (policy), Trivy / Grype (image scanning).
-
Best Practices:
- Shift-left performance & security tests (run earlier on critical paths).
- Contract tests to prevent breaking dependent services.
- Ephemeral test environments (provision on demand via IaC, tear down post-run).
-
Example:
# Performance test using k6 (simplified)
k6 run load-test.js
# Fail pipeline if 95th percentile latency > threshold
10. Stage 4: Continuous Deployment/Delivery
- Continuous Delivery: Code is always in a releasable state; promotions may require an approval gate.
- Continuous Deployment: Every passing change auto-deploys to production (subject to guardrails).
- Techniques: Blue/Green, Rolling, Canary, Feature flag rollout, Shadow traffic.
- IaC & Config: Terraform, CloudFormation, CDK, Pulumi, ensure environment parity.
- GitOps: Desired state (manifests) stored in Git; controllers (Argo CD, Flux) reconcile cluster state.
| Category | Tools |
|---|---|
| Deployment Orchestrators | Argo CD, Flux, Spinnaker, Harness, Octopus |
| Containers & Scheduling | Docker, Kubernetes (EKS), ECS |
| Packaging | Helm, Kustomize, OCI Artifacts |
| Config Management | Ansible, Puppet, Chef, Salt |
| IaC | Terraform, Pulumi, CloudFormation, CDK |
| Release Strategies | Flagger (canary), Argo Rollouts |
Example (GitOps Flow):
Developer merges -> CI builds & pushes image: myapp:v1.4.2
CI updates deployment manifest image tag in git (pull request)
Argo CD detects change -> syncs to cluster -> progressive rollout (canary 10% -> 50% -> 100%)
Metrics/error rate guard rollback if thresholds exceeded
11. Stage 5: Continuous Monitoring & Feedback
- Observability Pillars: Metrics, Logs, Traces, Events, Profiling, User experience data (RUM).
- Goals: Detect anomalies early, measure SLO compliance, feed improvement loops.
- Tools:
- Metrics & Alerts: Prometheus, CloudWatch, Datadog, Dynatrace
- Logs: ELK / OpenSearch, CloudWatch Logs, Splunk
- Tracing: OpenTelemetry, Jaeger, AWS X-Ray
- Visualization: Grafana, Kibana
- Synthetic & RUM: k6, Checkly, Pingdom, New Relic Browser
- Practices:
- Define Service Level Indicators (SLIs) & SLOs (e.g., availability, latency p95).
- Error budgets → govern release velocity vs. reliability focus.
- Correlate deploy events with performance changes (annotate dashboards).
- Centralize structured logs (JSON) + trace IDs propagation.
Example Alert Policy (Conceptual):
IF (http_request_error_rate_5m > 2%) AND (deployment_in_progress == true)
THEN trigger canary halt & page on-call
12. Continuous Security/DevSecOps
| Phase | Security Practices |
|---|---|
| Plan / Code | Threat modeling, secrets scanning, dependency governance |
| Build | Static analysis (SAST), license compliance, artifact signing |
| Test | Dynamic analysis (DAST), fuzzing, IaC security scanning (Checkov, tfsec) |
| Deploy | Policy as code (OPA/Gatekeeper), least privilege IAM roles, supply chain attestations (SLSA) |
| Operate | Runtime security (Falco, AWS GuardDuty), anomaly detection, automated key rotation |
| Monitor | Central SIEM correlation, continuous compliance (Cloud Custodian, AWS Config) |
13. Glossary
| Term | Definition |
|---|---|
| CI (Continuous Integration) | Automating build & test per change merged frequently |
| CD (Continuous Delivery/Deployment) | Keeping software always deployable / auto-deploying each change |
| GitOps | Managing infra & app state declaratively in Git with automated reconciliation |
| IaC | Infrastructure definitions in code enabling versioning & automation |
| SLI / SLO | Indicator & objective of service performance (e.g., latency p95 < 250ms) |
| Error Budget | Allowed amount of unreliability before slowing release pace |
| Canary Release | Deploy to a small subset of users/traffic to validate health |
| Blue/Green | Two production environments; switch traffic with minimal downtime |
| Observability | Ability to infer internal state from external outputs (metrics/logs/traces) |
| Shift-Left Security | Moving security feedback earlier in lifecycle stages |
14. Reference Example (Minimal GitHub Actions -> AWS Deployment Skeleton)
name: build-and-deploy
on:
push:
branches: [ main ]
jobs:
build:
runs-on: ubuntu-latest
permissions:
id-token: write # For OIDC federation
contents: read
steps:
- uses: actions/checkout@v4
- name: Set up Node
uses: actions/setup-node@v4
with:
node-version: 20
- run: npm ci && npm test
- name: Build image
run: docker build -t ${{ github.sha }} .
- name: Login to ECR
run: aws ecr get-login-password --region $AWS_REGION \
| docker login --username AWS --password-stdin "$ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com"
- name: Push image
run: |
docker tag ${{ github.sha }} $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/app:${{ github.sha }}
docker push $ACCOUNT_ID.dkr.ecr.$AWS_REGION.amazonaws.com/app:${{ github.sha }}
- name: Update manifest (example)
run: |
sed -i "s|IMAGE_TAG|${{ github.sha }}|" k8s/deployment.yaml
- name: Commit manifest (optional PR)
run: |
# In GitOps workflows, push changes tothe manifest repo rather than applying directly
echo "Create PR to Argo-monitored repo"