Skip to content

HYPERFLEET-580 - docs: Add operational runbook and metrics documentation#45

Merged
openshift-merge-bot[bot] merged 6 commits intoopenshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-580
Feb 5, 2026
Merged

HYPERFLEET-580 - docs: Add operational runbook and metrics documentation#45
openshift-merge-bot[bot] merged 6 commits intoopenshift-hyperfleet:mainfrom
rafabene:HYPERFLEET-580

Conversation

@rafabene
Copy link
Contributor

@rafabene rafabene commented Feb 4, 2026

Summary

  • Add docs/runbook.md with operational procedures for on-call operations and production troubleshooting
  • Add docs/metrics.md with Prometheus metrics documentation including PromQL examples

Details

docs/runbook.md

  • Health check interpretation guide (/healthz and /readyz)
  • Common operational procedures (restart, scaling, database operations)
  • Troubleshooting guide for frequent issues
  • Recovery steps for known failure modes
  • Escalation paths and severity levels

docs/metrics.md

  • Complete list of exposed Prometheus metrics
  • Description and meaning of each metric
  • Expected ranges and alerting threshold recommendations
  • Example PromQL queries for common investigations

Test plan

  • Verify documentation follows existing docs/ style
  • Verify metrics documented match code in cmd/hyperfleet-api/server/metrics_middleware.go
  • Verify health endpoints documented match code in pkg/health/handler.go
  • Review documentation accuracy

Jira

https://issues.redhat.com/browse/HYPERFLEET-580

Summary by CodeRabbit

  • Documentation
    • Added comprehensive Metrics documentation describing the Prometheus metrics exposed by the API, endpoint and metrics endpoint details, HTTP request metrics with labels and example outputs, expected operating ranges, example PromQL queries for request rate, errors, latency and resource usage, alert thresholds, and guidance for integrating with Prometheus Operator and Grafana dashboards.
    • Added an operational Runbook detailing service overview, health and readiness interpretations, metrics guidance, common operational procedures, troubleshooting guides, recovery workflows, escalation paths, and practical operational commands.

Add comprehensive operational documentation for hyperfleet-api:
- docs/runbook.md: Operational procedures, troubleshooting guide,
  health check interpretation, recovery steps, and escalation paths
- docs/metrics.md: Prometheus metrics reference with descriptions,
  expected ranges, alerting thresholds, and PromQL examples
@openshift-ci openshift-ci bot requested review from 86254860 and crizzo71 February 4, 2026 16:22
@coderabbitai
Copy link

coderabbitai bot commented Feb 4, 2026

Walkthrough

Adds two new documentation files: docs/metrics.md, a comprehensive Prometheus metrics reference for the HyperFleet API (metrics endpoint, metric names and labels, path normalization, example outputs, Go runtime and process metrics, expected ranges and alert thresholds, example PromQL queries, and Prometheus/Grafana integration guidance); and docs/runbook.md, an operational runbook describing service endpoints/ports, health probe semantics (liveness/readiness), common operational procedures, database operations, log analysis, troubleshooting workflows, recovery procedures, escalation paths, and kubectl/port-forwarding examples.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

Suggested labels

lgtm, approved

🚥 Pre-merge checks | ✅ 3
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The PR title clearly and concisely summarizes the main change: adding operational runbook and metrics documentation to the docs/ directory.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🤖 Fix all issues with AI agents
In `@docs/metrics.md`:
- Around line 34-39: The fenced code block containing the metric samples (e.g.,
lines starting with api_inbound_request_count{...}) lacks a language identifier
which triggers MD040; add a language tag after the opening backticks (for
example change ``` to ```text) so the block is fenced as ```text and the
markdown linter passes.
- Around line 57-65: The fenced code block in the metrics example lacks a
language tag (MD040); add a language identifier (e.g., "text") after the opening
``` to satisfy the linter. Locate the block containing the Prometheus metrics
like api_inbound_request_duration_bucket, api_inbound_request_duration_sum and
api_inbound_request_duration_count and change the opening fence from ``` to
```text so the snippet is recognized as plain text.
- Around line 24-30: The markdown table under the "Labels:" header violates
MD058; add a single blank line immediately before the table and another blank
line immediately after it so there is an empty line separating the surrounding
text and the table; update the block that lists the Label/Description/Example
Values (the table following the "Labels:" heading) to include those blank lines
to satisfy linting.

In `@docs/runbook.md`:
- Around line 56-62: The fenced code block containing the kubectl commands (the
``` block with "kubectl delete pod <pod-name> -n hyperfleet-system" and "kubectl
rollout restart deployment/hyperfleet-api -n hyperfleet-system") is missing a
language tag; add the language identifier "bash" to the opening fence (change
``` to ```bash) so the block satisfies MD040 and renders/syntax-highlights as
bash.
- Around line 19-23: The markdown table lacks surrounding blank lines which
triggers MD058; edit the table block in docs/runbook.md (the three-line table
starting with "| Response | Status | Meaning |") and insert a single empty line
immediately before the first table row and a single empty line immediately after
the last table row so the table is separated from surrounding paragraphs.

- Add language identifiers to fenced code blocks (MD040)
- Add blank line before Labels table (MD058)
- Replace placeholder escalation contacts with real Slack channel
- Add architecture diagram to Service Overview section
- Fix liveness probe documentation to reflect actual behavior (always returns 200 OK)
- Remove OCM CLI reference from architecture diagram
- Fix db config reference to use correct flag --db-max-open-connections
- Add configurable port info for metrics endpoint
- Add PromQL query example for filtering by instance/pod
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/metrics.md`:
- Around line 242-245: The header/comment "Requests taking longer than 1 second"
is misleading because the Prometheus query using
api_inbound_request_duration_bucket{le="1"} and
api_inbound_request_duration_count computes the fraction of requests completing
within ≤1s; either change the comment to "Requests completing within 1 second
(≤1s)" to match the query, or invert the calculation to match the original
wording by replacing the expression with 1 -
(sum(rate(api_inbound_request_duration_bucket{le="1"}[5m])) /
sum(rate(api_inbound_request_duration_count[5m]))) so it yields requests taking
longer than 1 second.

Inverted the calculation to correctly compute percentage of requests
taking longer than 1 second instead of requests completing within 1s.
Added note that rolling updates will not promote new pods until
they become ready.
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In `@docs/runbook.md`:
- Around line 288-289: The runbook currently prints secret contents with
"kubectl get secret hyperfleet-db -n hyperfleet-system -o yaml"; change this to
avoid exposing base64 data by showing only metadata/keys: replace the command
with one that omits values (for example use a go-template to print keys only,
e.g. kubectl get secret hyperfleet-db -n hyperfleet-system -o
go-template='{{range $k,$v := .data}}{{println $k}}{{end}}') so you only reveal
secret key names and metadata instead of the secret data itself.
🧹 Nitpick comments (1)
docs/runbook.md (1)

106-109: Call out cleanup for background port-forward.

The & backgrounded port-forward can linger and block future use of port 8080. Consider adding a brief note to stop the process (e.g., fg + Ctrl+C, or pkill -f "kubectl port-forward").

Also applies to: 281-283

Changed kubectl command to only show secret key names instead of
printing the full YAML with base64-encoded values.
@rh-amarin
Copy link
Contributor

/lgtm

@openshift-ci
Copy link

openshift-ci bot commented Feb 5, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rh-amarin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added the approved label Feb 5, 2026
@openshift-merge-bot openshift-merge-bot bot merged commit c89e058 into openshift-hyperfleet:main Feb 5, 2026
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants