diff --git a/docs/metrics.md b/docs/metrics.md new file mode 100644 index 0000000..c236a73 --- /dev/null +++ b/docs/metrics.md @@ -0,0 +1,278 @@ +# Metrics Documentation + +This document describes all Prometheus metrics exposed by HyperFleet API, including their meanings, expected ranges, and example queries for common investigations. + +## Metrics Endpoint + +Metrics are exposed at: +- **Endpoint**: `/metrics` +- **Port**: 9090 (default, configurable via `--metrics-server-bindaddress`) +- **Format**: OpenMetrics/Prometheus text format + +## Application Metrics + +### API Request Metrics + +These metrics track all inbound HTTP requests to the API server. + +#### `api_inbound_request_count` + +**Type:** Counter + +**Description:** Total number of HTTP requests served by the API. + +**Labels:** + +| Label | Description | Example Values | +|-------|-------------|----------------| +| `method` | HTTP method | `GET`, `POST`, `PUT`, `PATCH`, `DELETE` | +| `path` | Request path (with IDs replaced by `-`) | `/api/hyperfleet/v1/clusters/-` | +| `code` | HTTP response status code | `200`, `201`, `400`, `404`, `500` | + +**Path normalization:** Object identifiers in paths are replaced with `-` to reduce cardinality. For example, `/api/hyperfleet/v1/clusters/abc123` becomes `/api/hyperfleet/v1/clusters/-`. + +**Example output:** +```text +api_inbound_request_count{code="200",method="GET",path="/api/hyperfleet/v1/clusters"} 1523 +api_inbound_request_count{code="200",method="GET",path="/api/hyperfleet/v1/clusters/-"} 8742 +api_inbound_request_count{code="201",method="POST",path="/api/hyperfleet/v1/clusters"} 156 +api_inbound_request_count{code="404",method="GET",path="/api/hyperfleet/v1/clusters/-"} 23 +``` + +#### `api_inbound_request_duration` + +**Type:** Histogram + +**Description:** Distribution of request processing times in seconds. + +**Labels:** Same as `api_inbound_request_count` + +**Buckets:** `0.1s`, `1s`, `10s`, `30s` + +**Derived metrics:** +- `api_inbound_request_duration_sum` - Total time spent processing requests +- `api_inbound_request_duration_count` - Number of requests measured +- `api_inbound_request_duration_bucket` - Number of requests completed within each bucket + +**Example output:** +```text +api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="0.1"} 1450 +api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="1"} 1520 +api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="10"} 1523 +api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="30"} 1523 +api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="+Inf"} 1523 +api_inbound_request_duration_sum{code="200",method="GET",path="/api/hyperfleet/v1/clusters"} 45.23 +api_inbound_request_duration_count{code="200",method="GET",path="/api/hyperfleet/v1/clusters"} 1523 +``` + +## Go Runtime Metrics + +The following metrics are automatically exposed by the Prometheus Go client library. + +### Process Metrics + +| Metric | Type | Description | +|--------|------|-------------| +| `process_cpu_seconds_total` | Counter | Total user and system CPU time spent in seconds | +| `process_max_fds` | Gauge | Maximum number of open file descriptors | +| `process_open_fds` | Gauge | Number of open file descriptors | +| `process_resident_memory_bytes` | Gauge | Resident memory size in bytes | +| `process_start_time_seconds` | Gauge | Start time of the process since unix epoch | +| `process_virtual_memory_bytes` | Gauge | Virtual memory size in bytes | + +### Go Runtime Metrics + +| Metric | Type | Description | +|--------|------|-------------| +| `go_gc_duration_seconds` | Summary | A summary of pause durations during GC cycles | +| `go_goroutines` | Gauge | Number of goroutines currently existing | +| `go_memstats_alloc_bytes` | Gauge | Bytes allocated and still in use | +| `go_memstats_alloc_bytes_total` | Counter | Total bytes allocated (even if freed) | +| `go_memstats_heap_alloc_bytes` | Gauge | Heap bytes allocated and still in use | +| `go_memstats_heap_idle_bytes` | Gauge | Heap bytes waiting to be used | +| `go_memstats_heap_inuse_bytes` | Gauge | Heap bytes in use | +| `go_memstats_heap_objects` | Gauge | Number of allocated objects | +| `go_memstats_heap_sys_bytes` | Gauge | Heap bytes obtained from system | +| `go_memstats_sys_bytes` | Gauge | Total bytes obtained from system | +| `go_threads` | Gauge | Number of OS threads created | + +## Expected Ranges and Alerting Thresholds + +### Request Rate + +| Condition | Threshold | Severity | Description | +|-----------|-----------|----------|-------------| +| Normal | < 1000 req/s | - | Normal operating range | +| Warning | > 1000 req/s | Warning | High load, monitor closely | +| Critical | > 5000 req/s | Critical | Capacity limit approaching | + +### Error Rate + +| Condition | Threshold | Severity | Description | +|-----------|-----------|----------|-------------| +| Normal | < 1% | - | Normal error rate | +| Warning | 1-5% | Warning | Elevated errors, investigate | +| Critical | > 5% | Critical | High error rate, immediate action | + +### Latency (P99) + +| Condition | Threshold | Severity | Description | +|-----------|-----------|----------|-------------| +| Normal | < 500ms | - | Good response times | +| Warning | 500ms - 2s | Warning | Degraded performance | +| Critical | > 2s | Critical | Unacceptable latency | + +### Memory Usage + +| Condition | Threshold | Severity | Description | +|-----------|-----------|----------|-------------| +| Normal | < 70% of limit | - | Healthy memory usage | +| Warning | 70-85% of limit | Warning | Memory pressure | +| Critical | > 85% of limit | Critical | OOM risk | + +### Goroutines + +| Condition | Threshold | Severity | Description | +|-----------|-----------|----------|-------------| +| Normal | < 1000 | - | Normal goroutine count | +| Warning | 1000-5000 | Warning | High goroutine count | +| Critical | > 5000 | Critical | Possible goroutine leak | + +## Example PromQL Queries + +### Request Rate + +```promql +# Total request rate (requests per second) +sum(rate(api_inbound_request_count[5m])) + +# Request rate by pod/instance +sum(rate(api_inbound_request_count[5m])) by (instance) + +# Request rate by endpoint +sum(rate(api_inbound_request_count[5m])) by (path) + +# Request rate by status code +sum(rate(api_inbound_request_count[5m])) by (code) + +# Request rate by method +sum(rate(api_inbound_request_count[5m])) by (method) +``` + +### Error Rate + +```promql +# Overall error rate (5xx responses) +sum(rate(api_inbound_request_count{code=~"5.."}[5m])) / +sum(rate(api_inbound_request_count[5m])) * 100 + +# Error rate by endpoint +sum(rate(api_inbound_request_count{code=~"5.."}[5m])) by (path) / +sum(rate(api_inbound_request_count[5m])) by (path) * 100 + +# Client error rate (4xx responses) +sum(rate(api_inbound_request_count{code=~"4.."}[5m])) / +sum(rate(api_inbound_request_count[5m])) * 100 +``` + +### Latency + +```promql +# Average request duration (last 10 minutes) +rate(api_inbound_request_duration_sum[10m]) / +rate(api_inbound_request_duration_count[10m]) + +# Average request duration by endpoint +sum(rate(api_inbound_request_duration_sum[5m])) by (path) / +sum(rate(api_inbound_request_duration_count[5m])) by (path) + +# P50 latency (approximate using histogram) +histogram_quantile(0.5, sum(rate(api_inbound_request_duration_bucket[5m])) by (le)) + +# P90 latency +histogram_quantile(0.9, sum(rate(api_inbound_request_duration_bucket[5m])) by (le)) + +# P99 latency +histogram_quantile(0.99, sum(rate(api_inbound_request_duration_bucket[5m])) by (le)) + +# P99 latency by endpoint +histogram_quantile(0.99, sum(rate(api_inbound_request_duration_bucket[5m])) by (le, path)) +``` + +### Resource Usage + +```promql +# Memory usage in MB +process_resident_memory_bytes / 1024 / 1024 + +# Memory usage trend (increase over 1 hour) +delta(process_resident_memory_bytes[1h]) / 1024 / 1024 + +# Goroutine count +go_goroutines + +# Goroutine trend +delta(go_goroutines[1h]) + +# CPU usage rate +rate(process_cpu_seconds_total[5m]) + +# File descriptor usage percentage +process_open_fds / process_max_fds * 100 +``` + +### Common Investigation Queries + +```promql +# Slowest endpoints (average latency) +topk(10, + sum(rate(api_inbound_request_duration_sum[5m])) by (path) / + sum(rate(api_inbound_request_duration_count[5m])) by (path) +) + +# Most requested endpoints +topk(10, sum(rate(api_inbound_request_count[5m])) by (path)) + +# Endpoints with highest error rate +topk(10, + sum(rate(api_inbound_request_count{code=~"5.."}[5m])) by (path) / + sum(rate(api_inbound_request_count[5m])) by (path) +) + +# Percentage of requests taking longer than 1 second +1 - (sum(rate(api_inbound_request_duration_bucket{le="1"}[5m])) / +sum(rate(api_inbound_request_duration_count[5m]))) +``` + +## Prometheus Operator Integration + +If using Prometheus Operator, enable the ServiceMonitor in Helm values: + +```yaml +serviceMonitor: + enabled: true + interval: 30s + scrapeTimeout: 10s + labels: + release: prometheus # Match your Prometheus selector +``` + +See [Deployment Guide](deployment.md#prometheus-operator-integration) for details. + +## Grafana Dashboard + +Example dashboard JSON for HyperFleet API monitoring is available in the architecture repository. Key panels to include: + +1. **Request Rate** - Total requests per second over time +2. **Error Rate** - Percentage of 5xx responses +3. **Latency Distribution** - P50, P90, P99 latencies +4. **Request Duration Heatmap** - Visual distribution of request times +5. **Top Endpoints** - Most frequently accessed paths +6. **Memory Usage** - Resident memory over time +7. **Goroutines** - Goroutine count over time + +## Related Documentation + +- [Operational Runbook](runbook.md) - Troubleshooting and operational procedures +- [Deployment Guide](deployment.md) - Deployment and ServiceMonitor configuration +- [Development Guide](development.md) - Local development setup diff --git a/docs/runbook.md b/docs/runbook.md new file mode 100644 index 0000000..4b5f2b4 --- /dev/null +++ b/docs/runbook.md @@ -0,0 +1,402 @@ +# Operational Runbook + +This runbook provides operational procedures for managing HyperFleet API in production environments. + +## Service Overview + +HyperFleet API is a REST service that manages HyperFleet cluster and nodepool resources. It exposes: + +- **API Server**: Port 8000 - REST API endpoints +- **Health Server**: Port 8080 - Liveness (`/healthz`) and readiness (`/readyz`) probes +- **Metrics Server**: Port 9090 - Prometheus metrics (`/metrics`) + +### Architecture Diagram + +```text + ┌─────────────────────────────────────┐ + │ hyperfleet-api Pod │ + │ │ + ┌─────────────┐ │ ┌─────────────────────────────┐ │ + │ Clients │──────────────┼─▶│ API Server (:8000) │ │ + │ │ REST API │ │ /api/hyperfleet/v1/* │ │ + └─────────────┘ │ └──────────────┬──────────────┘ │ + │ │ │ + ┌─────────────┐ │ ┌──────────────▼──────────────┐ │ + │ Kubernetes │──────────────┼─▶│ Health Server (:8080) │ │ + │ Probes │ HTTP GET │ │ /healthz /readyz │ │ + └─────────────┘ │ └──────────────┬──────────────┘ │ + │ │ │ + ┌─────────────┐ │ ┌──────────────▼──────────────┐ │ + │ Prometheus │──────────────┼─▶│ Metrics Server (:9090) │ │ + │ │ Scrape │ │ /metrics │ │ + └─────────────┘ │ └─────────────────────────────┘ │ + │ │ │ + └─────────────────┼───────────────────┘ + │ + ▼ + ┌─────────────────────────────────────┐ + │ PostgreSQL │ + │ (clusters, nodepools) │ + └─────────────────────────────────────┘ +``` + +## Health Check Interpretation + +### Liveness Probe (`/healthz`) + +The liveness probe indicates whether the application process is alive and responsive. + +| Response | Status | Meaning | +|----------|--------|---------| +| `200 OK` | `{"status": "ok"}` | Process is alive and responsive | + +**Note:** The liveness probe always returns 200 OK if the HTTP server is responding. If the process crashes or hangs, Kubernetes will not receive a response and will restart the pod. + +**When liveness probe times out or connection fails:** +- The pod will be restarted by Kubernetes +- Check logs for fatal errors or panics +- This should be rare; frequent restarts indicate a serious issue + +### Readiness Probe (`/readyz`) + +The readiness probe indicates whether the application is ready to receive traffic. + +| Response | Status | Meaning | +|----------|--------|---------| +| `200 OK` | `{"status": "ok"}` | Ready to receive traffic | +| `503 Service Unavailable` | `{"status": "not_ready"}` | Still initializing or dependencies unavailable | +| `503 Service Unavailable` | `{"status": "shutting_down"}` | Graceful shutdown in progress | + +**Readiness checks include:** +- Application initialization complete +- Database connection available and responding to pings +- Not in shutdown state + +**When readiness fails:** +- Pod is removed from service endpoints (no traffic routed) +- Rolling updates will not promote new pods until they become ready +- Check database connectivity first +- Verify all required environment variables are set +- Check startup logs for initialization errors + +## Common Operational Procedures + +### Restarting the Service + +#### Single Pod Restart + +```bash +# Delete a specific pod (Kubernetes will recreate it) +kubectl delete pod -n hyperfleet-system + +# Or rollout restart the entire deployment +kubectl rollout restart deployment/hyperfleet-api -n hyperfleet-system +``` + +#### Verify Restart Success + +```bash +# Watch pods come up +kubectl get pods -n hyperfleet-system -w + +# Check readiness +kubectl get pods -n hyperfleet-system -o jsonpath='{.items[*].status.conditions[?(@.type=="Ready")].status}' + +# Verify health endpoints +kubectl port-forward svc/hyperfleet-api-health 8080:8080 -n hyperfleet-system & +curl http://localhost:8080/healthz +curl http://localhost:8080/readyz +``` + +### Scaling the Service + +#### Manual Scaling + +```bash +# Scale up +kubectl scale deployment/hyperfleet-api --replicas=5 -n hyperfleet-system + +# Scale down +kubectl scale deployment/hyperfleet-api --replicas=2 -n hyperfleet-system +``` + +#### Verify Scaling + +```bash +# Check replica count +kubectl get deployment hyperfleet-api -n hyperfleet-system + +# Verify all pods are ready +kubectl get pods -n hyperfleet-system -l app=hyperfleet-api +``` + +### Database Operations + +#### Check Database Connectivity + +```bash +# Check readiness probe (includes DB connectivity check) +kubectl port-forward svc/hyperfleet-api-health 8080:8080 -n hyperfleet-system & +curl http://localhost:8080/readyz + +# If readiness returns "Database ping failed", use a debug pod to test connectivity +kubectl run pg-debug --rm -it --image=postgres:15-alpine --restart=Never -n hyperfleet-system -- \ + pg_isready -h -p +``` + +#### Database Connection Pool Issues + +If you see `connection refused` or `too many connections` errors: + +1. Check current connection count on database +2. Verify `--db-max-open-connections` setting (default: 50) +3. Consider scaling down replicas to reduce connection load +4. Check for connection leaks in recent deployments + +#### Database Migrations + +Migrations run automatically via an init container (`db-migrate`) before the main application starts. This happens on every deployment. + +To manually run migrations (rarely needed): + +```bash +# Run a one-off migration job +kubectl run hyperfleet-migrate --rm -it \ + --image=quay.io/openshift-hyperfleet/hyperfleet-api:latest \ + --restart=Never \ + -n hyperfleet-system \ + --overrides='{"spec":{"containers":[{"name":"hyperfleet-migrate","image":"quay.io/openshift-hyperfleet/hyperfleet-api:latest","command":["/app/hyperfleet-api","migrate"],"volumeMounts":[{"name":"secrets","mountPath":"/build/secrets","readOnly":true}]}],"volumes":[{"name":"secrets","secret":{"secretName":"hyperfleet-db-external"}}]}}' \ + -- /app/hyperfleet-api migrate +``` + +Or trigger a rollout restart to re-run the init container: + +```bash +kubectl rollout restart deployment/hyperfleet-api -n hyperfleet-system +``` + +### Log Analysis + +#### View Real-time Logs + +```bash +# Single pod +kubectl logs -f deployment/hyperfleet-api -n hyperfleet-system + +# All pods +kubectl logs -f -l app=hyperfleet-api -n hyperfleet-system --max-log-requests=10 +``` + +#### Search for Errors + +```bash +# Recent errors +kubectl logs deployment/hyperfleet-api -n hyperfleet-system --since=1h | grep -i error + +# Structured log query (if using JSON logs) +kubectl logs deployment/hyperfleet-api -n hyperfleet-system --since=1h | jq 'select(.level == "error")' +``` + +## Troubleshooting Guide + +### Pod Not Starting + +**Symptoms:** Pod stuck in `Pending`, `ContainerCreating`, or `CrashLoopBackOff` + +**Diagnosis:** +```bash +kubectl describe pod -n hyperfleet-system +kubectl get events -n hyperfleet-system --sort-by='.lastTimestamp' | tail -20 +``` + +**Common causes:** +- **ImagePullBackOff**: Check image name, tag, and registry credentials +- **Insufficient resources**: Check node capacity and resource requests +- **ConfigMap/Secret not found**: Verify all required configs exist + +### Pod Crashing on Startup + +**Symptoms:** `CrashLoopBackOff` status, restarts > 0 + +**Diagnosis:** +```bash +# Check previous container logs +kubectl logs -n hyperfleet-system --previous + +# Check events +kubectl describe pod -n hyperfleet-system +``` + +**Common causes:** +- Missing or invalid environment variables +- Database connection failure +- Invalid configuration file +- Port already in use (unlikely in Kubernetes) + +### High Latency + +**Symptoms:** Slow API responses, timeouts + +**Diagnosis:** +```bash +# Check request duration metrics +curl -s http://:9090/metrics | grep api_inbound_request_duration + +# Check pod resource usage +kubectl top pods -n hyperfleet-system +``` + +**Common causes:** +- Database query performance issues +- Insufficient CPU/memory resources +- Network latency to database +- High concurrent request load + +### High Error Rate + +**Symptoms:** Increased 5xx responses, error logs + +**Diagnosis:** +```bash +# Check error count by path and code +curl -s http://:9090/metrics | grep api_inbound_request_count + +# Review error logs +kubectl logs deployment/hyperfleet-api -n hyperfleet-system --since=15m | grep -i error +``` + +**Common causes:** +- Database connection issues +- Invalid request data +- Upstream service failures +- Resource exhaustion + +### Database Connection Errors + +**Symptoms:** `connection refused`, `no such host`, `connection reset` + +**Diagnosis:** +```bash +# Check readiness probe (includes DB check) +kubectl port-forward svc/hyperfleet-api-health 8080:8080 -n hyperfleet-system & +curl http://localhost:8080/readyz + +# Test connectivity using a debug pod +kubectl run pg-debug --rm -it --image=postgres:15-alpine --restart=Never -n hyperfleet-system -- \ + pg_isready -h -p + +# Check database secret exists and has expected keys (does not print values) +kubectl get secret hyperfleet-db -n hyperfleet-system -o go-template='{{range $k,$v := .data}}{{println $k}}{{end}}' +``` + +**Resolution:** +1. Verify database host and port are correct +2. Check network policies allow egress to database +3. Verify database credentials are valid +4. Check database is running and accepting connections +5. Verify SSL settings match database requirements + +### Memory Issues + +**Symptoms:** OOMKilled, high memory usage + +**Diagnosis:** +```bash +# Check memory usage +kubectl top pods -n hyperfleet-system + +# Check for OOMKilled events +kubectl get events -n hyperfleet-system | grep -i oom +``` + +**Resolution:** +1. Increase memory limits in deployment +2. Check for memory leaks (increasing memory over time) +3. Review query patterns that may load large datasets + +## Recovery Procedures + +### Complete Service Recovery + +If the service is completely down: + +1. **Check namespace exists:** + ```bash + kubectl get namespace hyperfleet-system + ``` + +2. **Check deployment exists:** + ```bash + kubectl get deployment hyperfleet-api -n hyperfleet-system + ``` + +3. **Force recreate all pods:** + ```bash + kubectl rollout restart deployment/hyperfleet-api -n hyperfleet-system + ``` + +4. **Verify recovery:** + ```bash + kubectl rollout status deployment/hyperfleet-api -n hyperfleet-system + ``` + +### Database Recovery + +If database is unavailable: + +1. **Verify database status** (external DB or PostgreSQL pod) +2. **Check connectivity** from API pods +3. **If using built-in PostgreSQL:** + ```bash + kubectl rollout restart statefulset/hyperfleet-postgresql -n hyperfleet-system + ``` +4. **Wait for readiness probes to pass** before routing traffic + +### Rollback to Previous Version + +```bash +# View rollout history +kubectl rollout history deployment/hyperfleet-api -n hyperfleet-system + +# Rollback to previous version +kubectl rollout undo deployment/hyperfleet-api -n hyperfleet-system + +# Rollback to specific revision +kubectl rollout undo deployment/hyperfleet-api -n hyperfleet-system --to-revision=2 +``` + +## Escalation Paths + +### Severity Levels + +| Level | Description | Response Time | Example | +|-------|-------------|---------------|---------| +| **P1 - Critical** | Complete service outage | Immediate | All pods crashing, database unavailable | +| **P2 - High** | Degraded service | 30 minutes | High error rate, significant latency | +| **P3 - Medium** | Minor impact | 4 hours | Single pod issues, non-critical errors | +| **P4 - Low** | No user impact | Next business day | Log noise, documentation issues | + +### Escalation Contacts + +For all HyperFleet issues, escalate via the team Slack channel: + +- **Channel**: [#hcm-hyperfleet-team](https://redhat.enterprise.slack.com/archives/C0916E39DQV) + +### When to Escalate + +- **Escalate immediately** if: + - Complete service outage affecting users + - Data integrity issues suspected + - Security incident detected + - Unable to diagnose issue within 30 minutes + +- **Escalate within 1 hour** if: + - Partial outage or degraded performance + - Issue requires access you don't have + - Root cause is unclear after initial investigation + +## Related Documentation + +- [Deployment Guide](deployment.md) - Deployment and configuration +- [Metrics Documentation](metrics.md) - Prometheus metrics reference +- [Development Guide](development.md) - Local development setup