HYPERFLEET-580 - docs: Add operational runbook and metrics documentation by rafabene · Pull Request #45 · openshift-hyperfleet/hyperfleet-api

rafabene · 2026-02-04T16:22:26Z

Summary

Add docs/runbook.md with operational procedures for on-call operations and production troubleshooting
Add docs/metrics.md with Prometheus metrics documentation including PromQL examples

Details

docs/runbook.md

Health check interpretation guide (/healthz and /readyz)
Common operational procedures (restart, scaling, database operations)
Troubleshooting guide for frequent issues
Recovery steps for known failure modes
Escalation paths and severity levels

docs/metrics.md

Complete list of exposed Prometheus metrics
Description and meaning of each metric
Expected ranges and alerting threshold recommendations
Example PromQL queries for common investigations

Test plan

Verify documentation follows existing docs/ style
Verify metrics documented match code in cmd/hyperfleet-api/server/metrics_middleware.go
Verify health endpoints documented match code in pkg/health/handler.go
Review documentation accuracy

Jira

https://issues.redhat.com/browse/HYPERFLEET-580

Summary by CodeRabbit

Documentation
- Added comprehensive Metrics documentation describing the Prometheus metrics exposed by the API, endpoint and metrics endpoint details, HTTP request metrics with labels and example outputs, expected operating ranges, example PromQL queries for request rate, errors, latency and resource usage, alert thresholds, and guidance for integrating with Prometheus Operator and Grafana dashboards.
- Added an operational Runbook detailing service overview, health and readiness interpretations, metrics guidance, common operational procedures, troubleshooting guides, recovery workflows, escalation paths, and practical operational commands.

Add comprehensive operational documentation for hyperfleet-api: - docs/runbook.md: Operational procedures, troubleshooting guide, health check interpretation, recovery steps, and escalation paths - docs/metrics.md: Prometheus metrics reference with descriptions, expected ranges, alerting thresholds, and PromQL examples

coderabbitai · 2026-02-04T16:22:51Z

Walkthrough

Adds two new documentation files: docs/metrics.md, a comprehensive Prometheus metrics reference for the HyperFleet API (metrics endpoint, metric names and labels, path normalization, example outputs, Go runtime and process metrics, expected ranges and alert thresholds, example PromQL queries, and Prometheus/Grafana integration guidance); and docs/runbook.md, an operational runbook describing service endpoints/ports, health probe semantics (liveness/readiness), common operational procedures, database operations, log analysis, troubleshooting workflows, recovery procedures, escalation paths, and kubectl/port-forwarding examples.

Estimated code review effort

🎯 2 (Simple) | ⏱️ ~10 minutes

Possibly related PRs

docs: consolidate and streamline documentation structure #21: Related documentation consolidation and expansion; appears to complement earlier docs changes in PR #21.

Suggested labels

lgtm, approved

🚥 Pre-merge checks | ✅ 3

✅ Passed checks (3 passed)

Check name	Status	Explanation
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.
Title check	✅ Passed	The PR title clearly and concisely summarizes the main change: adding operational runbook and metrics documentation to the docs/ directory.
Docstring Coverage	✅ Passed	No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 5

🤖 Fix all issues with AI agents

In `@docs/metrics.md`:
- Around line 34-39: The fenced code block containing the metric samples (e.g.,
lines starting with api_inbound_request_count{...}) lacks a language identifier
which triggers MD040; add a language tag after the opening backticks (for
example change ``` to ```text) so the block is fenced as ```text and the
markdown linter passes.
- Around line 57-65: The fenced code block in the metrics example lacks a
language tag (MD040); add a language identifier (e.g., "text") after the opening
``` to satisfy the linter. Locate the block containing the Prometheus metrics
like api_inbound_request_duration_bucket, api_inbound_request_duration_sum and
api_inbound_request_duration_count and change the opening fence from ``` to
```text so the snippet is recognized as plain text.
- Around line 24-30: The markdown table under the "Labels:" header violates
MD058; add a single blank line immediately before the table and another blank
line immediately after it so there is an empty line separating the surrounding
text and the table; update the block that lists the Label/Description/Example
Values (the table following the "Labels:" heading) to include those blank lines
to satisfy linting.

In `@docs/runbook.md`:
- Around line 56-62: The fenced code block containing the kubectl commands (the
``` block with "kubectl delete pod <pod-name> -n hyperfleet-system" and "kubectl
rollout restart deployment/hyperfleet-api -n hyperfleet-system") is missing a
language tag; add the language identifier "bash" to the opening fence (change
``` to ```bash) so the block satisfies MD040 and renders/syntax-highlights as
bash.
- Around line 19-23: The markdown table lacks surrounding blank lines which
triggers MD058; edit the table block in docs/runbook.md (the three-line table
starting with "| Response | Status | Meaning |") and insert a single empty line
immediately before the first table row and a single empty line immediately after
the last table row so the table is separated from surrounding paragraphs.

docs/metrics.md

docs/runbook.md

- Add language identifiers to fenced code blocks (MD040) - Add blank line before Labels table (MD058) - Replace placeholder escalation contacts with real Slack channel - Add architecture diagram to Service Overview section

docs/runbook.md

docs/metrics.md

docs/runbook.md

- Fix liveness probe documentation to reflect actual behavior (always returns 200 OK) - Remove OCM CLI reference from architecture diagram - Fix db config reference to use correct flag --db-max-open-connections - Add configurable port info for metrics endpoint - Add PromQL query example for filtering by instance/pod

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/metrics.md`:
- Around line 242-245: The header/comment "Requests taking longer than 1 second"
is misleading because the Prometheus query using
api_inbound_request_duration_bucket{le="1"} and
api_inbound_request_duration_count computes the fraction of requests completing
within ≤1s; either change the comment to "Requests completing within 1 second
(≤1s)" to match the query, or invert the calculation to match the original
wording by replacing the expression with 1 -
(sum(rate(api_inbound_request_duration_bucket{le="1"}[5m])) /
sum(rate(api_inbound_request_duration_count[5m]))) so it yields requests taking
longer than 1 second.

docs/metrics.md

docs/runbook.md

Inverted the calculation to correctly compute percentage of requests taking longer than 1 second instead of requests completing within 1s.

Added note that rolling updates will not promote new pods until they become ready.

coderabbitai

Actionable comments posted: 1

🤖 Fix all issues with AI agents

In `@docs/runbook.md`:
- Around line 288-289: The runbook currently prints secret contents with
"kubectl get secret hyperfleet-db -n hyperfleet-system -o yaml"; change this to
avoid exposing base64 data by showing only metadata/keys: replace the command
with one that omits values (for example use a go-template to print keys only,
e.g. kubectl get secret hyperfleet-db -n hyperfleet-system -o
go-template='{{range $k,$v := .data}}{{println $k}}{{end}}') so you only reveal
secret key names and metadata instead of the secret data itself.

🧹 Nitpick comments (1)

docs/runbook.md (1)

106-109: Call out cleanup for background port-forward.

The & backgrounded port-forward can linger and block future use of port 8080. Consider adding a brief note to stop the process (e.g., fg + Ctrl+C, or pkill -f "kubectl port-forward").

Also applies to: 281-283

docs/runbook.md

Changed kubectl command to only show secret key names instead of printing the full YAML with base64-encoded values.

rh-amarin · 2026-02-05T17:59:14Z

/lgtm

openshift-ci · 2026-02-05T17:59:21Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: rh-amarin

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details

Needs approval from an approver in each of these files:

~~OWNERS~~ [rh-amarin]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

openshift-ci bot requested review from 86254860 and crizzo71 February 4, 2026 16:22

coderabbitai bot reviewed Feb 4, 2026

View reviewed changes

docs/metrics.md Show resolved Hide resolved

docs/metrics.md Outdated Show resolved Hide resolved

docs/metrics.md Outdated Show resolved Hide resolved

docs/runbook.md Show resolved Hide resolved

docs/runbook.md Show resolved Hide resolved

HYPERFLEET-580 - fix: Address review feedback

6c60e83

- Add language identifiers to fenced code blocks (MD040) - Add blank line before Labels table (MD058) - Replace placeholder escalation contacts with real Slack channel - Add architecture diagram to Service Overview section

ciaranRoche reviewed Feb 5, 2026

View reviewed changes

docs/runbook.md Outdated Show resolved Hide resolved

docs/metrics.md Outdated Show resolved Hide resolved

docs/metrics.md Show resolved Hide resolved

docs/runbook.md Outdated Show resolved Hide resolved

docs/runbook.md Outdated Show resolved Hide resolved

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

docs/metrics.md Outdated Show resolved Hide resolved

rh-amarin reviewed Feb 5, 2026

View reviewed changes

docs/runbook.md Show resolved Hide resolved

rh-amarin reviewed Feb 5, 2026

View reviewed changes

docs/runbook.md Show resolved Hide resolved

rafabene added 2 commits February 5, 2026 14:34

HYPERFLEET-580 - fix: Correct PromQL query for requests > 1 second

3294624

Inverted the calculation to correctly compute percentage of requests taking longer than 1 second instead of requests completing within 1s.

HYPERFLEET-580 - fix: Add rolling update behavior to readiness docs

fe88abf

Added note that rolling updates will not promote new pods until they become ready.

coderabbitai bot reviewed Feb 5, 2026

View reviewed changes

docs/runbook.md Outdated Show resolved Hide resolved

HYPERFLEET-580 - fix: Avoid exposing secret values in runbook

7e4d45d

Changed kubectl command to only show secret key names instead of printing the full YAML with base64-encoded values.

openshift-ci bot assigned rh-amarin Feb 5, 2026

openshift-ci bot added the lgtm label Feb 5, 2026

openshift-ci bot added the approved label Feb 5, 2026

openshift-merge-bot bot merged commit c89e058 into openshift-hyperfleet:main Feb 5, 2026
8 checks passed

Conversation

rafabene commented Feb 4, 2026 • edited by coderabbitai bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Details

docs/runbook.md

docs/metrics.md

Test plan

Jira

Summary by CodeRabbit

Uh oh!

coderabbitai bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Estimated code review effort

Possibly related PRs

Suggested labels

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

rh-amarin commented Feb 5, 2026

Uh oh!

openshift-ci bot commented Feb 5, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

rafabene commented Feb 4, 2026 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Feb 4, 2026 •

edited

Loading