Skip to content

yg-codes/srepowers

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

85 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

SREPowers

SRE infrastructure skills for Claude Code: Test-Driven Operations and Subagent-Driven Operations for Kubernetes, Keycloak, GitOps, API workflows, and more.

Overview

SREPowers adapts proven software development workflows (TDD, subagent-driven development) for infrastructure operations. These skills help you execute infrastructure changes systematically with verification-first discipline.

Skill Workflow Diagram

graph TD
    Start([Need to perform<br/>infrastructure operation]) --> Decision{Have a plan?}
    Decision -->|No| Brainstorm[brainstorming-operations]
    Decision -->|Yes, detailed| WritePlan[writing-operation-plans]
    Decision -->|Yes, ready to execute| ExecMode{Execution mode?}

    Brainstorm --> WritePlan
    WritePlan --> ExecMode

    ExecMode -->|Same session,<br/>continuous| Subagent[subagent-driven-operation]
    ExecMode -->|Separate session,<br/>checkpoints| Execute[executing-operation-plans]

    Subagent --> TDO[test-driven-operation]
    Execute --> TDO

    TDO --> More{More tasks?}
    More -->|Yes| TDO
    More -->|No| Verify[verification-before-completion]
    Verify --> Finish[finishing-operation-branch]

    Finish --> End([Complete])

    style Start fill:#e1f5e1
    style End fill:#e1f5e1
    style TDO fill:#fff4e1
    style Subagent fill:#e1f0ff
    style Execute fill:#e1f0ff
    style Verify fill:#ffe1e1
Loading

SRE Principles

All skills in SREPowers are bound by five core principles:

# Principle Description
1 Safety First All operational commands MUST include dry-run validation before execution
2 Structured Output Use tables, bullet points, and explicit phases (Pre-check → Execute → Verify)
3 Evidence-Driven Always reference specific log lines, metrics, or config parameters
4 Audit-Ready Every recommendation must be traceable and reversible
5 Communication Technical accuracy with business clarity

Security

SREPowers enforces a safety-first security posture across all infrastructure operations:

Capability How Enforced Primary Skills
Dry-run validation All operational commands require dry-run before execution (Principle #1) safety-validator
Risk classification 4-tier system (Critical/High/Medium/Low) with typed confirmation for destructive ops safety-validator
Least privilege Non-root containers, minimal RBAC, scoped service accounts kubernetes-specialist, container-engineer, platform-engineer
Secret management No hardcoded secrets, scanning patterns, external secret references security-reviewer, terraform-engineer
Secure coding OWASP Top 10 prevention, input validation, authentication patterns secure-code-guardian
Infrastructure security DevSecOps pipelines, compliance automation, cloud security audits security-reviewer

Key security skills:

  • /safety-validator -- Review proposed commands before execution; 4-tier risk classification with typed confirmation for destructive operations
  • /security-reviewer -- Security audits, SAST/dependency/secret scanning, penetration testing, infrastructure security reviews
  • /secure-code-guardian -- Application security, OWASP Top 10 prevention, authentication/authorization, encryption

Every operation skill integrates safety checks. The test-driven-operation Iron Law ("no infrastructure change without a failing verification first") ensures changes are validated before they reach production.

Installation

Via Claude Code Marketplace (Recommended)

# Add the marketplace
/plugin marketplace add yg-codes/srepowers

# Install the plugin
/plugin install srepowers@srepowers-marketplace

# Verify installation
/help
# You should see:
# /test-driven-operation - Use when executing infrastructure operations...
# /subagent-driven-operation - Use when executing infrastructure operation plans...

Manual Installation

Clone this repository to your local skills directory:

# Clone the repository
git clone https://github.com/yg-codes/srepowers.git ~/.claude/plugins/srepowers

# Or copy skills directly
cp -r srepowers/skills/* ~/.claude/skills/

Skill Selection Guide

Situation Recommended Skill Alternative
Planning phase
Need to design an infrastructure operation brainstorming-operations -
Have a design, need detailed steps writing-operation-plans -
Execution phase
Ready to execute, want continuous flow subagent-driven-operation -
Long operation, need checkpoints executing-operation-plans -
Single operation with verification test-driven-operation -
About to claim work is done/deployed/healthy verification-before-completion -
Kubernetes
Deploy workloads, configure cluster kubernetes-specialist -
Build container images container-engineer -
Progressive deployment progressive-delivery -
Infrastructure as Code
Write Terraform modules terraform-engineer -
Orchestrate with Terragrunt terragrunt-expert -
Databases
PostgreSQL operations postgresql-engineer -
Incident Response
Production incident incident-commander systematic-troubleshooting
Write post-mortem post-mortem-writer -
Cost & Optimization
Analyze cloud costs cost-optimizer -
Reduce operational toil toil-analysis -
Observability
Set up monitoring observability-engineer -
Verify with metrics observability-integration -

Choosing Your Workflow

Not every operation needs the full brainstorm-plan-execute-verify spine. SREPowers adapts automatically:

Execution Patterns

The subagent-driven-operation skill selects a pattern based on plan characteristics:

Pattern When Behavior
Inline <= 2 tasks AND risk is not high Execute in main context, no subagent spawn, self-review
Segmented 3-6 tasks, no decision checkpoints Batch into segments of 2-3, subagent per segment
Full Subagent 7+ tasks OR high risk OR any task lacks rollback Fresh subagent per task with two-stage review (spec + quality)

TDO Exceptions

The test-driven-operation Iron Law has three defined exceptions (require human partner consent):

Exception When It Applies Example
Emergency response Time-critical incident Production outage, active security incident
Read-only diagnostics Only querying state kubectl get, terraform plan, log analysis
Dry-run exploration First pass only, no changes terraform plan, kubectl diff --dry-run

Workflow Tier Selection

Situation Recommended Path
Simple query or read-only check Use domain skill directly (e.g., /kubernetes-specialist)
Single change with clear expected outcome /test-driven-operation (inline)
2-6 independent tasks, medium risk /subagent-driven-operation (inline or segmented)
7+ tasks or high risk /subagent-driven-operation (full) or /executing-operation-plans
Unsure what to do /brainstorming-operations first, then choose above

Available Skills

test-driven-operation

Use when: Executing infrastructure operations with verification commands - API calls, kubectl, Keycloak CRDs, Git MRs, Linux server operations.

Core principle: If you didn't watch the verification fail, you don't know if it verifies the right thing.

Workflow:

  1. RED - Write failing verification command (kubectl, API call, etc.)
  2. Verify RED - Run it and watch it fail
  3. GREEN - Execute minimal infrastructure operation
  4. Verify GREEN - Run verification and confirm it passes
  5. REFACTOR - Document and clean up

Example:

# RED - Verification fails
kubectl get pod -n production -l app=api-server
# Error: No resources found

# GREEN - Apply minimal manifest
kubectl apply -f api-server-pod.yaml

# Verify GREEN - Passes
kubectl get pod -n production -l app=api-server
# NAME          READY   STATUS    RESTARTS   AGE
# api-server    1/1     Running   0          5s

subagent-driven-operation

Use when: Executing infrastructure operation plans with independent tasks in the current session.

Core principle: Fresh subagent per task + two-stage review (spec compliance then artifact quality) = high quality, fast iteration.

Adaptive execution patterns (selected based on plan complexity):

Pattern When Token Savings
Inline <= 2 tasks, low risk ~14K per task
Segmented 3-6 tasks 30-50% vs full
Full Subagent 7+ tasks or high risk Baseline

Workflow:

  1. Read plan, parse YAML frontmatter, check for resume state
  2. Select execution pattern (inline/segmented/full)
  3. For each task (or segment):
    • Dispatch operator subagent with full task text
    • Execute operations following Test-Driven Operation
    • Handle deviations (R1-R4 taxonomy with retry limits)
    • Spec compliance review - Verify all requirements met
    • Artifact quality review - Verify YAML/JSON valid, proper labels/annotations
    • Update execution state in plan file
  4. After all tasks: Final artifact review

Two-Stage Review:

  • Spec Compliance: Did we execute exactly what was requested?
  • Artifact Quality: Are the infrastructure artifacts well-built?

brainstorming-operations

Use when: Planning infrastructure operations before implementation.

Core principle: Design operations with risk assessment, verification strategies, and rollback plans before executing.

Workflow:

  1. Understand current infrastructure state
  2. Ask questions to refine operation scope
  3. Present design in sections with validation
  4. Document current state, desired state, approach
  5. Include risk assessment and rollback strategies

Output: Design document saved to docs/plans/YYYY-MM-DD-<operation-name>-design.md

writing-operation-plans

Use when: You have a design and need to create bite-sized execution steps.

Core principle: Create detailed plans with exact commands, verification steps, and rollback instructions.

Workflow:

  1. Write plan with TDO discipline for each task
  2. Include exact commands (no placeholders)
  3. Document verification commands with expected outputs
  4. Provide rollback steps for each task
  5. Save to docs/plans/YYYY-MM-DD-<operation-name>.md

Output: Execution plan that operators can follow step-by-step.

Plan format: YAML frontmatter with risk level, environment, status tracking, and requirements traceability (works with ClickUp, Jira, Linear, or any issue tracker).

Quality gate: Automated plan-checker subagent validates 6 dimensions (rollback coverage, verification concreteness, environment boundaries, dry-run presence, side-effect checks, risk consistency) before execution handoff.

gitlab-ecr-pipeline

Use when: Creating GitLab CI/CD pipelines that push container images to AWS ECR.

Core principle: Generate complete pipelines with proper authentication, building, and pushing.

Supports: Building from Containerfile/Dockerfile, mirroring upstream images

Features: AWS ECR authentication, Podman/buildah support, multi-stage builds, tagging strategies

puppet-code-analyzer

Use when: Analyzing Puppet code quality in control repos or modules.

Core principle: Automated analysis with linting, dependency checking, best practice validation.

Features: Syntax validation, dependency analysis, style guide compliance, error troubleshooting

Workflow:

  1. Identify Puppet control repo or module
  2. Run syntax validation with puppet-lint
  3. Analyze dependencies and module structure
  4. Check style guide compliance
  5. Generate analysis report with recommendations

pve-admin

Use when: Managing Proxmox VE 8.x/9.x and Proxmox Backup Server 3.x infrastructure.

Core principle: Complete Proxmox administration with cluster management and safe operations.

Features: Cluster management, VM/CT operations, ZFS storage, networking, HA, backup/restore, health checks

Operations:

  • VM/CT lifecycle (create, start, stop, migrate)
  • Storage management (ZFS, LVM, directory, NFS)
  • Network configuration (bridges, bonds, VLANs)
  • Cluster operations (join, leave, quorum)
  • Backup/restore (PBS integration)
  • Health monitoring and diagnostics

sre-runbook

Use when: Creating structured SRE runbooks for infrastructure operations.

Core principle: Runbooks with Command/Expected/Result format for verifiable procedures.

Output: Structured runbooks with pre-requisites, step-by-step procedures, verification, rollback

Format:

  • Pre-requisites (access, tools, state)
  • Procedures with Command/Expected/Result format
  • Verification steps
  • Rollback procedures
  • Troubleshooting section

executing-operation-plans

Use when: You have a written infrastructure operation plan to execute in a separate session with review checkpoints - for long-running operations requiring human review between steps.

Core principle: Batch execution with checkpoints for safety verification and human review.

Workflow:

  • Load and review plan, parse YAML frontmatter, check for resume state
  • Pre-execution safety check
  • Execute batch (3 tasks or per-environment) with TDO discipline
  • Handle deviations (R1-R4 taxonomy)
  • Batch verification, update execution state in plan file
  • Report and checkpoint
  • Continue or complete

Resume support: Plans track execution state (pending/in_progress/completed) with per-task status, enabling resume after interruption.

verification-before-completion

Use when: About to claim infrastructure work is complete, deployed, fixed, or healthy — before any commit, PR, or status update.

Core principle: Evidence before claims, always. No completion claims without fresh verification command output.

Gate function: Identify verification command → Run it → Read full output → Verify → Only then claim.

Requirements traceability: Cross-references plan's acceptance criteria against task execution evidence. All requirements must be done or explicitly skipped before completion.

Common SRE failures prevented:

  • Helm exit 0 ≠ deployment succeeded (run kubectl rollout status)
  • Pod Running ≠ service healthy (check health endpoint)
  • kubectl apply exit 0 ≠ config applied (read back the value)
  • Agent reports success ≠ verified (check VCS diff)

observability-integration

Use when: Verifying infrastructure operations using metrics and alerting data from Prometheus, Grafana, or other observability platforms.

Core principle: Metrics don't lie - use observability data to verify operations and detect issues early.

Features:

  • Pre/post operation metric comparison
  • Baseline establishment
  • Alert validation
  • Prometheus query examples
  • Integration with TDO cycles

incident-commander

Use when: Coordinating response to major infrastructure incidents requiring structured incident command.

Core principle: Clear command structure + effective communication + systematic troubleshooting = faster incident resolution.

Features:

  • ICS-style role assignment (IC, Operations, Communications, Scribe)
  • Severity levels and escalation triggers
  • Communication templates
  • Timeline tracking
  • Multi-phase response process

post-mortem-writer

Use when: Creating blameless post-mortems after infrastructure incidents.

Core principle: Blameless post-mortems create a culture of learning and continuous improvement.

Features:

  • Structured post-mortem template
  • Timeline reconstruction
  • Root cause analysis framework
  • Action item tracking
  • Blameless writing guidelines

progressive-delivery

Use when: Releasing changes with staged traffic shifting, SLO-based rollback triggers, or blue-green cutover.

Core principle: Each traffic stage is a TDO cycle — verify SLOs before promoting to the next stage.

Features:

  • Canary release workflow (1% → 5% → 25% → 50% → 100%)
  • Blue-green cutover with immediate rollback capability
  • Shadow traffic validation (zero user impact testing)
  • SLO-based rollback triggers at each stage
  • Per-stage verification commands

toil-analysis

Use when: Quantifying operational toil, planning automation investments, or justifying headcount decisions.

Core principle: Toil > 50% of engineering capacity means freeze feature work and automate.

Features:

  • Toil inventory with time tracking (task × frequency × duration)
  • Capacity planning projection model (5-quarter growth forecast)
  • Automation prioritization matrix (Impact × Ease × Risk scoring)
  • Reduction progress tracking with before/after measurement

architecture-designer

Use when: Designing new system architecture, reviewing existing designs, or making architectural decisions.

Focus: Design patterns, ADRs, scalability planning, system design review.

chaos-engineer

Use when: Designing chaos experiments, implementing failure injection frameworks, or conducting game day exercises.

Focus: Blast radius control, game days, antifragile systems, resilience testing.

cloud-architect

Use when: Designing cloud architectures, planning migrations, or optimizing multi-cloud deployments.

Focus: Well-Architected Framework, cost optimization, disaster recovery, landing zones, serverless.

code-documenter

Use when: Adding docstrings, creating API documentation, or building documentation sites.

Focus: OpenAPI/Swagger specs, JSDoc, doc portals, tutorials, user guides.

code-reviewer

Use when: Reviewing pull requests, conducting code quality audits, or identifying security vulnerabilities.

Focus: PR reviews, code quality checks, refactoring suggestions.

devops-engineer

Use when: Setting up CI/CD pipelines, containerizing applications, or managing infrastructure as code.

Focus: Pipelines, Docker, Kubernetes, cloud platforms, GitOps.

golang-pro

Use when: Building Go applications requiring concurrent programming, microservices architecture, or high-performance systems.

Focus: Goroutines, channels, Go generics, gRPC integration.

kubernetes-specialist

Use when: Deploying or managing Kubernetes workloads requiring cluster configuration, security hardening, or troubleshooting.

Focus: Helm charts, RBAC, NetworkPolicies, storage, performance optimization.

microservices-architect

Use when: Designing distributed systems, decomposing monoliths, or implementing microservices patterns.

Focus: Service boundaries, DDD, saga patterns, event sourcing, service mesh, distributed tracing.

observability-engineer

Use when: Setting up observability systems including monitoring, logging, metrics, tracing, or alerting.

Focus: Dashboards, Prometheus/Grafana, OpenTelemetry, load testing, profiling, capacity planning, SLO-based alerting.

postgresql-engineer

Use when: Optimizing PostgreSQL queries, configuring replication, or implementing advanced database features.

Focus: EXPLAIN analysis, JSONB operations, extension usage, VACUUM tuning, performance monitoring, complex SQL patterns, query migration.

python-pro

Use when: Building Python 3.11+ applications requiring type safety, async programming, or production-grade patterns.

Focus: Type hints, pytest, async/await, dataclasses, mypy configuration.

rust-engineer

Use when: Building Rust applications requiring memory safety, systems programming, or zero-cost abstractions.

Focus: Ownership patterns, lifetimes, traits, async/await with tokio.

secure-code-guardian

Use when: Implementing authentication/authorization, securing user input, or preventing OWASP Top 10 vulnerabilities.

Focus: Authentication, authorization, input validation, encryption.

security-reviewer

Use when: Conducting security audits, reviewing code for vulnerabilities, or analyzing infrastructure security.

Focus: SAST scans, penetration testing, DevSecOps practices, cloud security reviews.

cost-optimizer

Use when: Analyzing cloud costs, optimizing resource spending, or planning reserved capacity.

Focus: AWS/GCP/Azure cost analysis, right-sizing, reserved instances, spot instances, cost allocation, FinOps practices.

sre-engineer

Use when: Defining SLIs/SLOs, managing error budgets, or building reliable systems at scale.

Focus: Incident management, chaos engineering, toil reduction, capacity planning.

terraform-engineer

Use when: Implementing infrastructure as code with Terraform across AWS, Azure, or GCP.

Focus: Module development, state management, provider configuration, multi-environment workflows.

terragrunt-expert

Use when: Orchestrating Terraform/OpenTofu modules with Terragrunt - DRY configurations, stack architecture, dependency management.

Core principle: Eliminate duplication across environments with Terragrunt's include blocks, dependency management, and remote state automation.

Features:

  • DRY configurations across environments
  • Stack architecture (implicit/explicit)
  • Dependency graph management with mock outputs
  • Remote state automation with backend configuration
  • Multi-environment deployment workflows

container-engineer

Use when: Building, optimizing, or securing container images and orchestration for production environments.

Core principle: Build lean, secure, and maintainable container images with multi-stage builds, security hardening, and supply chain security.

Features:

  • Multi-stage Dockerfile patterns
  • Image size optimization and layer caching
  • Security hardening (non-root, read-only filesystem, capabilities)
  • Supply chain security (SBOM, cosign, SLSA)
  • Docker Compose for orchestration
  • Kubernetes runtime (containerd, CRI-O)
  • Vulnerability scanning and remediation

network-engineer

Use when: Designing, optimizing, or troubleshooting cloud and hybrid network infrastructures.

Core principle: Design networks that are scalable, secure, and highly available with proper segmentation and zero-trust principles.

Features:

  • VPC architecture (single/multi-region)
  • Load balancing strategies (Layer 4/7, global, internal)
  • DNS management and failover routing
  • VPN, Direct Connect, ExpressRoute, Cloud Interconnect
  • Zero-trust network architecture
  • Network segmentation and security groups

platform-engineer

Use when: Building or improving internal developer platforms (IDPs), designing self-service infrastructure, or optimizing developer workflows.

Core principle: Treat the platform as a product with developers as customers - reduce cognitive load through self-service and golden paths.

Features:

  • Internal Developer Platforms (IDPs)
  • Self-service infrastructure capabilities
  • Golden path templates for services
  • Backstage developer portal implementation
  • Service catalogs and software templates
  • Platform metrics and adoption tracking

test-master

Use when: Writing tests, creating test strategies, or building automation frameworks.

Focus: Unit tests, integration tests, E2E, coverage analysis, performance testing, security testing.

Commands

Quick invoke skills using /command syntax:

SRE Operations:

  • /test-driven-operation - Execute operations with verification commands
  • /subagent-driven-operation - Execute operation plans with subagent dispatch
  • /brainstorming-operations - Design infrastructure operations
  • /writing-operation-plans - Create detailed execution plans
  • /sre-runbook - Create structured SRE runbooks

Workspace & Lifecycle:

  • /using-git-worktrees-sre - Create isolated workspaces for control repos
  • /finishing-operation-branch - Complete operations with merge/PR workflow

Incident Response:

  • /systematic-troubleshooting - 4-phase root cause analysis for incidents
  • /incident-commander - Coordinate major incident response with ICS structure
  • /post-mortem-writer - Create blameless post-mortems

Operations Enhancement:

  • /executing-operation-plans - Execute plans in separate sessions with checkpoints
  • /dispatching-parallel-agents-sre - Run 2+ independent infrastructure tasks in parallel
  • /observability-integration - Verify operations using metrics and alerting data (Prometheus, Datadog, CloudWatch, New Relic)
  • /verification-before-completion - Enforce evidence-before-claims before any completion status
  • /safety-validator - Review commands for high-risk operations
  • /progressive-delivery - Canary/blue-green release with SLO-based rollback triggers
  • /toil-analysis - Measure toil, plan automation investments, model capacity
  • /receiving-code-review-sre - Process code review feedback on infrastructure changes

CI/CD & Pipelines:

  • /gitlab-ecr-pipeline - GitLab CI/CD → AWS ECR pipelines

Architecture & Design:

  • /architecture-designer - System architecture design and review
  • /cloud-architect - Cloud architecture and multi-cloud optimization
  • /microservices-architect - Distributed systems and microservices patterns

DevOps & Infrastructure:

  • /devops-engineer - CI/CD pipelines, containers, infrastructure as code
  • /terraform-engineer - Infrastructure as code with Terraform
  • /terragrunt-expert - Terragrunt orchestration for Terraform/OpenTofu
  • /container-engineer - Container builds, optimization, and security
  • /network-engineer - Network infrastructure and architecture
  • /kubernetes-specialist - Kubernetes operations depth
  • /chaos-engineer - Resilience testing and failure injection
  • /platform-engineer - Internal Developer Platforms (IDPs)

Observability & Reliability:

  • /observability-engineer - Observability stack setup and management
  • /sre-engineer - SLO/SLI management and reliability at scale

Cost & Optimization:

  • /cost-optimizer - Cloud cost analysis and optimization

Languages & Development:

  • /golang-pro - Go application development
  • /python-pro - Python application development
  • /rust-engineer - Rust systems programming
  • /postgresql-engineer - PostgreSQL operations and SQL optimization

Security:

  • /secure-code-guardian - Application security and OWASP prevention
  • /security-reviewer - Security audits and infrastructure security

Quality & Documentation:

  • /code-reviewer - Code quality audits and PR reviews
  • /code-documenter - API documentation and docstrings
  • /test-master - Testing strategy and automation

Meta & Utilities:

  • /using-srepowers - Meta-skill: how to find and use SRE skills
  • /writing-skills-sre - Create or edit SRE infrastructure skills
  • /environment-health-check - Verify required tools are installed
  • /playground-tutorial - Safe, local tutorial for learning TDO

Companion Plugin: Superpowers

SREPowers is a companion plugin to superpowers. It adapts superpowers' software development workflows for SRE/infrastructure operations. Install both for complete coverage:

Software Development (superpowers) SRE Infrastructure (srepowers)
test-driven-development test-driven-operation
subagent-driven-development subagent-driven-operation
brainstorming brainstorming-operations
writing-plans writing-operation-plans
executing-plans executing-operation-plans
using-git-worktrees using-git-worktrees-sre
finishing-a-development-branch finishing-operation-branch
systematic-debugging systematic-troubleshooting
verification-before-completion verification-before-completion (shared)
dispatching-parallel-agents dispatching-parallel-agents-sre
receiving-code-review receiving-code-review-sre
writing-skills writing-skills-sre (extends upstream)

The following are provided by superpowers only (no SREPowers equivalent):

  • requesting-code-review — pre-review checklist for code

SREPowers adds 30+ SRE-native skills with no superpowers equivalent (incident command, runbooks, PVE, Puppet, GitLab ECR, observability, progressive delivery, toil, cost, and domain expertise skills).

Developer Tools

Skill Generator

Create new skills with the scaffolding tool:

# Interactive mode
python scripts/create-skill.py

# With arguments
python scripts/create-skill.py \
  --name my-skill \
  --description "Use when doing X" \
  --category core

This generates:

  • skills/my-skill/SKILL.md - Skill definition
  • commands/my-skill.md - Command wrapper
  • skills/my-skill/references/ - Reference directory
  • tests/claude-code/test-my-skill.sh - Test template

Evaluation Framework

Run automated evaluations to verify skill output quality:

# Run all evals
python evals/eval-runner.py

# Run specific skill eval
python evals/eval-runner.py --skill sre-runbook

# Generate report
python evals/eval-runner.py --report results.md

Commands are thin wrappers that invoke skills directly for quick access.

Usage Examples

Kubernetes Deployment

# Write verification first
kubectl get deployment -n staging api-server -o jsonpath='{.spec.replicas}'

# Apply deployment
kubectl apply -f deployment.yaml

# Verify
kubectl get deployment -n staging api-server -o jsonpath='{.spec.replicas}'
# Output: 3

Keycloak Realm Provisioning

# Write verification first
kubectl get keycloakrealm/example-realm -o jsonpath='{.status.ready}'

# Apply Keycloak CRD
kubectl apply -f keycloak-realm.yaml

# Verify
kubectl get keycloakrealm/example-realm -o jsonpath='{.status.ready}'
# Output: true

Git Control Repo Operation

# Write verification first
kubectl get configmap -n production app-config -o jsonpath='{.data.DATABASE_URL}'

# Create config in control repo
cat > manifests/production/app-config.yaml <<EOF
apiVersion: v1
kind: ConfigMap
metadata:
  name: app-config
  namespace: production
data:
  DATABASE_URL: postgresql://prod-db.example.com:5432/app
EOF

git add manifests/production/app-config.yaml
git commit -m "Add production database config"
git push

# Wait for ArgoCD/Flux sync, then verify
kubectl get configmap -n production app-config -o jsonpath='{.data.DATABASE_URL}'
# Output: postgresql://prod-db.example.com:5432/app

API Operation

# Write verification first
curl -s https://api.example.com/users/123 | jq '.email'
# Output: null

# Execute API call
curl -X PATCH https://api.example.com/users/123 \
  -H "Content-Type: application/json" \
  -d '{"email": "user@example.com"}'

# Verify
curl -s https://api.example.com/users/123 | jq '.email'
# Output: "user@example.com"

Key Principles

Test-Driven Operation (TDO)

  • Tests = Verification commands (kubectl, API calls, Git queries)
  • Commits = Git operations on control repo
  • Always write verification first, run it, watch it fail
  • Execute minimal operation to pass
  • Verify output matches expected result

Subagent-Driven Operation

  • Operator = Infrastructure operations specialist
  • Artifact quality review = YAML/JSON validity, Kubernetes best practices
  • Tests = Verification commands
  • Commits = Git operations on control repo
  • Adaptive patterns = Inline (<=2 tasks), Segmented (3-6), Full (7+ or high risk)
  • Deviation taxonomy = R1-R4 (auto-fix through STOP) with retry limits
  • Execution state = Per-task tracking in plan file for resume after interruption

Two-Stage Review

  1. Spec Compliance - Verified all operations executed, nothing missing/extra
  2. Artifact Quality - YAML/JSON valid, proper labels/annotations, security best practices

Documentation

Contributing

Contributions are welcome! Repository: github.com/yg-codes/srepowers

Please:

  1. Fork the repository
  2. Create a feature branch (cu_your_feature)
  3. Follow the skill format (SKILL.md with frontmatter)
  4. Test your skills thoroughly
  5. Submit a pull request

For bug reports and feature requests, open an issue.

License

MIT License - see LICENSE for details.

Acknowledgments

Adapted from the excellent superpowers plugin by Jesse Vital, with adaptations for SRE infrastructure workflows.

Version History

See git log for version history and changes.

About

collections of agent skills used by SRE infrastructure works.

Resources

License

Stars

Watchers

Forks

Packages

 
 
 

Contributors