-
Notifications
You must be signed in to change notification settings - Fork 12
HYPERFLEET-580 - docs: Add operational runbook and metrics documentation #45
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
openshift-merge-bot
merged 6 commits into
openshift-hyperfleet:main
from
rafabene:HYPERFLEET-580
Feb 5, 2026
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
e2397d6
HYPERFLEET-580 - docs: Add operational runbook and metrics documentation
rafabene 6c60e83
HYPERFLEET-580 - fix: Address review feedback
rafabene 8ca31e2
HYPERFLEET-580 - fix: Address additional review feedback
rafabene 3294624
HYPERFLEET-580 - fix: Correct PromQL query for requests > 1 second
rafabene fe88abf
HYPERFLEET-580 - fix: Add rolling update behavior to readiness docs
rafabene 7e4d45d
HYPERFLEET-580 - fix: Avoid exposing secret values in runbook
rafabene File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,278 @@ | ||
| # Metrics Documentation | ||
|
|
||
| This document describes all Prometheus metrics exposed by HyperFleet API, including their meanings, expected ranges, and example queries for common investigations. | ||
|
|
||
| ## Metrics Endpoint | ||
|
|
||
| Metrics are exposed at: | ||
| - **Endpoint**: `/metrics` | ||
| - **Port**: 9090 (default, configurable via `--metrics-server-bindaddress`) | ||
| - **Format**: OpenMetrics/Prometheus text format | ||
|
|
||
| ## Application Metrics | ||
|
|
||
| ### API Request Metrics | ||
|
|
||
| These metrics track all inbound HTTP requests to the API server. | ||
|
|
||
| #### `api_inbound_request_count` | ||
|
|
||
| **Type:** Counter | ||
|
|
||
| **Description:** Total number of HTTP requests served by the API. | ||
|
|
||
| **Labels:** | ||
|
|
||
| | Label | Description | Example Values | | ||
| |-------|-------------|----------------| | ||
| | `method` | HTTP method | `GET`, `POST`, `PUT`, `PATCH`, `DELETE` | | ||
| | `path` | Request path (with IDs replaced by `-`) | `/api/hyperfleet/v1/clusters/-` | | ||
| | `code` | HTTP response status code | `200`, `201`, `400`, `404`, `500` | | ||
|
|
||
| **Path normalization:** Object identifiers in paths are replaced with `-` to reduce cardinality. For example, `/api/hyperfleet/v1/clusters/abc123` becomes `/api/hyperfleet/v1/clusters/-`. | ||
|
|
||
| **Example output:** | ||
| ```text | ||
| api_inbound_request_count{code="200",method="GET",path="/api/hyperfleet/v1/clusters"} 1523 | ||
| api_inbound_request_count{code="200",method="GET",path="/api/hyperfleet/v1/clusters/-"} 8742 | ||
| api_inbound_request_count{code="201",method="POST",path="/api/hyperfleet/v1/clusters"} 156 | ||
| api_inbound_request_count{code="404",method="GET",path="/api/hyperfleet/v1/clusters/-"} 23 | ||
| ``` | ||
|
|
||
| #### `api_inbound_request_duration` | ||
|
|
||
| **Type:** Histogram | ||
|
|
||
| **Description:** Distribution of request processing times in seconds. | ||
|
|
||
| **Labels:** Same as `api_inbound_request_count` | ||
|
|
||
| **Buckets:** `0.1s`, `1s`, `10s`, `30s` | ||
|
|
||
| **Derived metrics:** | ||
| - `api_inbound_request_duration_sum` - Total time spent processing requests | ||
| - `api_inbound_request_duration_count` - Number of requests measured | ||
| - `api_inbound_request_duration_bucket` - Number of requests completed within each bucket | ||
|
|
||
| **Example output:** | ||
| ```text | ||
| api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="0.1"} 1450 | ||
| api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="1"} 1520 | ||
| api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="10"} 1523 | ||
| api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="30"} 1523 | ||
| api_inbound_request_duration_bucket{code="200",method="GET",path="/api/hyperfleet/v1/clusters",le="+Inf"} 1523 | ||
| api_inbound_request_duration_sum{code="200",method="GET",path="/api/hyperfleet/v1/clusters"} 45.23 | ||
| api_inbound_request_duration_count{code="200",method="GET",path="/api/hyperfleet/v1/clusters"} 1523 | ||
| ``` | ||
|
|
||
| ## Go Runtime Metrics | ||
|
|
||
| The following metrics are automatically exposed by the Prometheus Go client library. | ||
|
|
||
| ### Process Metrics | ||
|
|
||
| | Metric | Type | Description | | ||
| |--------|------|-------------| | ||
| | `process_cpu_seconds_total` | Counter | Total user and system CPU time spent in seconds | | ||
| | `process_max_fds` | Gauge | Maximum number of open file descriptors | | ||
| | `process_open_fds` | Gauge | Number of open file descriptors | | ||
| | `process_resident_memory_bytes` | Gauge | Resident memory size in bytes | | ||
| | `process_start_time_seconds` | Gauge | Start time of the process since unix epoch | | ||
| | `process_virtual_memory_bytes` | Gauge | Virtual memory size in bytes | | ||
|
|
||
| ### Go Runtime Metrics | ||
|
|
||
| | Metric | Type | Description | | ||
| |--------|------|-------------| | ||
| | `go_gc_duration_seconds` | Summary | A summary of pause durations during GC cycles | | ||
| | `go_goroutines` | Gauge | Number of goroutines currently existing | | ||
| | `go_memstats_alloc_bytes` | Gauge | Bytes allocated and still in use | | ||
| | `go_memstats_alloc_bytes_total` | Counter | Total bytes allocated (even if freed) | | ||
| | `go_memstats_heap_alloc_bytes` | Gauge | Heap bytes allocated and still in use | | ||
| | `go_memstats_heap_idle_bytes` | Gauge | Heap bytes waiting to be used | | ||
| | `go_memstats_heap_inuse_bytes` | Gauge | Heap bytes in use | | ||
| | `go_memstats_heap_objects` | Gauge | Number of allocated objects | | ||
| | `go_memstats_heap_sys_bytes` | Gauge | Heap bytes obtained from system | | ||
| | `go_memstats_sys_bytes` | Gauge | Total bytes obtained from system | | ||
| | `go_threads` | Gauge | Number of OS threads created | | ||
|
|
||
| ## Expected Ranges and Alerting Thresholds | ||
|
|
||
| ### Request Rate | ||
|
|
||
| | Condition | Threshold | Severity | Description | | ||
| |-----------|-----------|----------|-------------| | ||
| | Normal | < 1000 req/s | - | Normal operating range | | ||
| | Warning | > 1000 req/s | Warning | High load, monitor closely | | ||
| | Critical | > 5000 req/s | Critical | Capacity limit approaching | | ||
|
|
||
| ### Error Rate | ||
|
|
||
| | Condition | Threshold | Severity | Description | | ||
| |-----------|-----------|----------|-------------| | ||
| | Normal | < 1% | - | Normal error rate | | ||
| | Warning | 1-5% | Warning | Elevated errors, investigate | | ||
| | Critical | > 5% | Critical | High error rate, immediate action | | ||
|
|
||
| ### Latency (P99) | ||
|
|
||
| | Condition | Threshold | Severity | Description | | ||
| |-----------|-----------|----------|-------------| | ||
| | Normal | < 500ms | - | Good response times | | ||
| | Warning | 500ms - 2s | Warning | Degraded performance | | ||
| | Critical | > 2s | Critical | Unacceptable latency | | ||
|
|
||
| ### Memory Usage | ||
|
|
||
| | Condition | Threshold | Severity | Description | | ||
| |-----------|-----------|----------|-------------| | ||
| | Normal | < 70% of limit | - | Healthy memory usage | | ||
| | Warning | 70-85% of limit | Warning | Memory pressure | | ||
| | Critical | > 85% of limit | Critical | OOM risk | | ||
|
|
||
| ### Goroutines | ||
|
|
||
| | Condition | Threshold | Severity | Description | | ||
| |-----------|-----------|----------|-------------| | ||
| | Normal | < 1000 | - | Normal goroutine count | | ||
| | Warning | 1000-5000 | Warning | High goroutine count | | ||
| | Critical | > 5000 | Critical | Possible goroutine leak | | ||
|
|
||
| ## Example PromQL Queries | ||
|
|
||
| ### Request Rate | ||
|
|
||
| ```promql | ||
rafabene marked this conversation as resolved.
Show resolved
Hide resolved
|
||
| # Total request rate (requests per second) | ||
| sum(rate(api_inbound_request_count[5m])) | ||
|
|
||
| # Request rate by pod/instance | ||
| sum(rate(api_inbound_request_count[5m])) by (instance) | ||
|
|
||
| # Request rate by endpoint | ||
| sum(rate(api_inbound_request_count[5m])) by (path) | ||
|
|
||
| # Request rate by status code | ||
| sum(rate(api_inbound_request_count[5m])) by (code) | ||
|
|
||
| # Request rate by method | ||
| sum(rate(api_inbound_request_count[5m])) by (method) | ||
| ``` | ||
|
|
||
| ### Error Rate | ||
|
|
||
| ```promql | ||
| # Overall error rate (5xx responses) | ||
| sum(rate(api_inbound_request_count{code=~"5.."}[5m])) / | ||
| sum(rate(api_inbound_request_count[5m])) * 100 | ||
|
|
||
| # Error rate by endpoint | ||
| sum(rate(api_inbound_request_count{code=~"5.."}[5m])) by (path) / | ||
| sum(rate(api_inbound_request_count[5m])) by (path) * 100 | ||
|
|
||
| # Client error rate (4xx responses) | ||
| sum(rate(api_inbound_request_count{code=~"4.."}[5m])) / | ||
| sum(rate(api_inbound_request_count[5m])) * 100 | ||
| ``` | ||
|
|
||
| ### Latency | ||
|
|
||
| ```promql | ||
| # Average request duration (last 10 minutes) | ||
| rate(api_inbound_request_duration_sum[10m]) / | ||
| rate(api_inbound_request_duration_count[10m]) | ||
|
|
||
| # Average request duration by endpoint | ||
| sum(rate(api_inbound_request_duration_sum[5m])) by (path) / | ||
| sum(rate(api_inbound_request_duration_count[5m])) by (path) | ||
|
|
||
| # P50 latency (approximate using histogram) | ||
| histogram_quantile(0.5, sum(rate(api_inbound_request_duration_bucket[5m])) by (le)) | ||
|
|
||
| # P90 latency | ||
| histogram_quantile(0.9, sum(rate(api_inbound_request_duration_bucket[5m])) by (le)) | ||
|
|
||
| # P99 latency | ||
| histogram_quantile(0.99, sum(rate(api_inbound_request_duration_bucket[5m])) by (le)) | ||
|
|
||
| # P99 latency by endpoint | ||
| histogram_quantile(0.99, sum(rate(api_inbound_request_duration_bucket[5m])) by (le, path)) | ||
| ``` | ||
|
|
||
| ### Resource Usage | ||
|
|
||
| ```promql | ||
| # Memory usage in MB | ||
| process_resident_memory_bytes / 1024 / 1024 | ||
|
|
||
| # Memory usage trend (increase over 1 hour) | ||
| delta(process_resident_memory_bytes[1h]) / 1024 / 1024 | ||
|
|
||
| # Goroutine count | ||
| go_goroutines | ||
|
|
||
| # Goroutine trend | ||
| delta(go_goroutines[1h]) | ||
|
|
||
| # CPU usage rate | ||
| rate(process_cpu_seconds_total[5m]) | ||
|
|
||
| # File descriptor usage percentage | ||
| process_open_fds / process_max_fds * 100 | ||
| ``` | ||
|
|
||
| ### Common Investigation Queries | ||
|
|
||
| ```promql | ||
| # Slowest endpoints (average latency) | ||
| topk(10, | ||
| sum(rate(api_inbound_request_duration_sum[5m])) by (path) / | ||
| sum(rate(api_inbound_request_duration_count[5m])) by (path) | ||
| ) | ||
|
|
||
| # Most requested endpoints | ||
| topk(10, sum(rate(api_inbound_request_count[5m])) by (path)) | ||
|
|
||
| # Endpoints with highest error rate | ||
| topk(10, | ||
| sum(rate(api_inbound_request_count{code=~"5.."}[5m])) by (path) / | ||
| sum(rate(api_inbound_request_count[5m])) by (path) | ||
| ) | ||
|
|
||
| # Percentage of requests taking longer than 1 second | ||
| 1 - (sum(rate(api_inbound_request_duration_bucket{le="1"}[5m])) / | ||
| sum(rate(api_inbound_request_duration_count[5m]))) | ||
| ``` | ||
|
|
||
| ## Prometheus Operator Integration | ||
|
|
||
| If using Prometheus Operator, enable the ServiceMonitor in Helm values: | ||
|
|
||
| ```yaml | ||
| serviceMonitor: | ||
| enabled: true | ||
| interval: 30s | ||
| scrapeTimeout: 10s | ||
| labels: | ||
| release: prometheus # Match your Prometheus selector | ||
| ``` | ||
|
|
||
| See [Deployment Guide](deployment.md#prometheus-operator-integration) for details. | ||
|
|
||
| ## Grafana Dashboard | ||
|
|
||
| Example dashboard JSON for HyperFleet API monitoring is available in the architecture repository. Key panels to include: | ||
|
|
||
| 1. **Request Rate** - Total requests per second over time | ||
| 2. **Error Rate** - Percentage of 5xx responses | ||
| 3. **Latency Distribution** - P50, P90, P99 latencies | ||
| 4. **Request Duration Heatmap** - Visual distribution of request times | ||
| 5. **Top Endpoints** - Most frequently accessed paths | ||
| 6. **Memory Usage** - Resident memory over time | ||
| 7. **Goroutines** - Goroutine count over time | ||
|
|
||
| ## Related Documentation | ||
|
|
||
| - [Operational Runbook](runbook.md) - Troubleshooting and operational procedures | ||
| - [Deployment Guide](deployment.md) - Deployment and ServiceMonitor configuration | ||
| - [Development Guide](development.md) - Local development setup | ||
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.