HYPERFLEET-580 - docs: Add operational runbook and metrics documentation#45
Conversation
Add comprehensive operational documentation for hyperfleet-api: - docs/runbook.md: Operational procedures, troubleshooting guide, health check interpretation, recovery steps, and escalation paths - docs/metrics.md: Prometheus metrics reference with descriptions, expected ranges, alerting thresholds, and PromQL examples
WalkthroughAdds two new documentation files: Estimated code review effort🎯 2 (Simple) | ⏱️ ~10 minutes Possibly related PRs
Suggested labels
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 5
🤖 Fix all issues with AI agents
In `@docs/metrics.md`:
- Around line 34-39: The fenced code block containing the metric samples (e.g.,
lines starting with api_inbound_request_count{...}) lacks a language identifier
which triggers MD040; add a language tag after the opening backticks (for
example change ``` to ```text) so the block is fenced as ```text and the
markdown linter passes.
- Around line 57-65: The fenced code block in the metrics example lacks a
language tag (MD040); add a language identifier (e.g., "text") after the opening
``` to satisfy the linter. Locate the block containing the Prometheus metrics
like api_inbound_request_duration_bucket, api_inbound_request_duration_sum and
api_inbound_request_duration_count and change the opening fence from ``` to
```text so the snippet is recognized as plain text.
- Around line 24-30: The markdown table under the "Labels:" header violates
MD058; add a single blank line immediately before the table and another blank
line immediately after it so there is an empty line separating the surrounding
text and the table; update the block that lists the Label/Description/Example
Values (the table following the "Labels:" heading) to include those blank lines
to satisfy linting.
In `@docs/runbook.md`:
- Around line 56-62: The fenced code block containing the kubectl commands (the
``` block with "kubectl delete pod <pod-name> -n hyperfleet-system" and "kubectl
rollout restart deployment/hyperfleet-api -n hyperfleet-system") is missing a
language tag; add the language identifier "bash" to the opening fence (change
``` to ```bash) so the block satisfies MD040 and renders/syntax-highlights as
bash.
- Around line 19-23: The markdown table lacks surrounding blank lines which
triggers MD058; edit the table block in docs/runbook.md (the three-line table
starting with "| Response | Status | Meaning |") and insert a single empty line
immediately before the first table row and a single empty line immediately after
the last table row so the table is separated from surrounding paragraphs.
- Add language identifiers to fenced code blocks (MD040) - Add blank line before Labels table (MD058) - Replace placeholder escalation contacts with real Slack channel - Add architecture diagram to Service Overview section
- Fix liveness probe documentation to reflect actual behavior (always returns 200 OK) - Remove OCM CLI reference from architecture diagram - Fix db config reference to use correct flag --db-max-open-connections - Add configurable port info for metrics endpoint - Add PromQL query example for filtering by instance/pod
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/metrics.md`:
- Around line 242-245: The header/comment "Requests taking longer than 1 second"
is misleading because the Prometheus query using
api_inbound_request_duration_bucket{le="1"} and
api_inbound_request_duration_count computes the fraction of requests completing
within ≤1s; either change the comment to "Requests completing within 1 second
(≤1s)" to match the query, or invert the calculation to match the original
wording by replacing the expression with 1 -
(sum(rate(api_inbound_request_duration_bucket{le="1"}[5m])) /
sum(rate(api_inbound_request_duration_count[5m]))) so it yields requests taking
longer than 1 second.
Inverted the calculation to correctly compute percentage of requests taking longer than 1 second instead of requests completing within 1s.
Added note that rolling updates will not promote new pods until they become ready.
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Fix all issues with AI agents
In `@docs/runbook.md`:
- Around line 288-289: The runbook currently prints secret contents with
"kubectl get secret hyperfleet-db -n hyperfleet-system -o yaml"; change this to
avoid exposing base64 data by showing only metadata/keys: replace the command
with one that omits values (for example use a go-template to print keys only,
e.g. kubectl get secret hyperfleet-db -n hyperfleet-system -o
go-template='{{range $k,$v := .data}}{{println $k}}{{end}}') so you only reveal
secret key names and metadata instead of the secret data itself.
🧹 Nitpick comments (1)
docs/runbook.md (1)
106-109: Call out cleanup for background port-forward.The
&backgrounded port-forward can linger and block future use of port 8080. Consider adding a brief note to stop the process (e.g.,fg+Ctrl+C, orpkill -f "kubectl port-forward").Also applies to: 281-283
Changed kubectl command to only show secret key names instead of printing the full YAML with base64-encoded values.
|
/lgtm |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: rh-amarin The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c89e058
into
openshift-hyperfleet:main
Summary
docs/runbook.mdwith operational procedures for on-call operations and production troubleshootingdocs/metrics.mdwith Prometheus metrics documentation including PromQL examplesDetails
docs/runbook.md
/healthzand/readyz)docs/metrics.md
Test plan
docs/stylecmd/hyperfleet-api/server/metrics_middleware.gopkg/health/handler.goJira
https://issues.redhat.com/browse/HYPERFLEET-580
Summary by CodeRabbit