Monitor vm faults and alert on no valid host found#667
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughAdded a VM faults KPI and Prometheus alert; extended Nova Server model with fault fields and flavor hypervisor parsing; switched OpenStack servers table to Changes
Sequence Diagram(s)sequenceDiagram
participant Prom as Prometheus/Scraper
participant KPI as VMFaultsKPI
participant DB as Database
participant Nova as Nova Models
Prom->>KPI: Collect()
KPI->>DB: Query servers from `openstack_servers_v2`
DB-->>KPI: Server rows (id, az, fault*, flavor)
KPI->>DB: Query flavors from `openstack_flavors_v2`
DB-->>KPI: Flavor rows (name, ExtraSpecs)
loop per server
KPI->>Nova: GetHypervisorType(flavor.ExtraSpecs)
Nova-->>KPI: FlavorHypervisorType
KPI->>KPI: Build labels (az,hvtype,state,faultcode,faultmsg,faultyvm)
KPI->>KPI: Aggregate counts by label set
end
KPI->>Prom: Emit `cortex_vm_faults` gauges
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@helm/bundles/cortex-nova/alerts/nova.alerts.yaml`:
- Around line 596-603: The alert CortexNovaDoesntFindValidKVMHosts is using the
wrong label name and is not aggregating per-AZ/hypervisor pair; change the
selector on metric cortex_vm_faults to use faultmsg (not faultmessage) and
aggregate away the per-VM dimension (faultyvm) so one signal is produced per
AZ/hypervisor pair — e.g. wrap the filtered series in a sum by (az, hvtype) (or
the AZ and hypervisor labels used in cortex_vm_faults) and then compare that
result to > 0; update the expr in the CortexNovaDoesntFindValidKVMHosts alert
accordingly.
In `@internal/knowledge/datasources/plugins/openstack/nova/nova_types.go`:
- Around line 359-363: GetHypervisorType currently calls json.Unmarshal on
f.ExtraSpecs which will error on an empty string; update the function
(Flavor.GetHypervisorType) to handle the empty ExtraSpecs case by checking if
f.ExtraSpecs == "" (or len(trimspace(f.ExtraSpecs)) == 0) and initializing
extraSpecs = map[string]string{} (or returning a zero-value FlavorHypervisorType
and nil error if that makes sense), otherwise unmarshal as before; keep the rest
of the logic that inspects extraSpecs intact so you avoid spurious json
unmarshal errors for empty ExtraSpecs.
In `@tools/plutono/provisioning/dashboards/cortex-status.json`:
- Around line 607-615: The panel's PromQL expression (the "expr" field currently
using cortex_vm_faults{faultcode!=\"0\"}) is undercounting message-only faults;
update the expression to filter on the VMFaultsKPI indicator instead by
replacing the selector cortex_vm_faults{faultcode!=\"0\"} with
cortex_vm_faults{faultyvm!=\"no\"} so the panel counts VMs flagged as faulty
when faultmsg is present; keep the rest of the panel (legendFormat, refId "A",
etc.) unchanged.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 88822616-d091-45bf-bb54-1b91d9132d6b
📒 Files selected for processing (15)
helm/bundles/cortex-nova/alerts/nova.alerts.yamlhelm/bundles/cortex-nova/templates/kpis.yamlinternal/knowledge/datasources/plugins/openstack/nova/nova_types.gointernal/knowledge/extractor/plugins/compute/libvirt_domain_cpu_steal_pct.sqlinternal/knowledge/extractor/plugins/compute/libvirt_domain_cpu_steal_pct_test.gointernal/knowledge/extractor/plugins/compute/vm_host_residency.sqlinternal/knowledge/extractor/plugins/compute/vm_life_span.sqlinternal/knowledge/extractor/plugins/compute/vrops_hostsystem_resolver.sqlinternal/knowledge/extractor/plugins/compute/vrops_project_noisiness.sqlinternal/knowledge/kpis/plugins/compute/vm_faults.gointernal/knowledge/kpis/plugins/compute/vm_faults_test.gointernal/knowledge/kpis/supported_kpis.gointernal/scheduling/reservations/commitments/controller.gotools/plutono/provisioning/dashboards/cortex-status.jsontools/visualize-reservations/main.go
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@helm/bundles/cortex-nova/alerts/nova.alerts.yaml`:
- Around line 596-598: The CortexNovaDoesntFindValidKVMHosts alert uses a regex
that requires characters before/after the phrase and lacks a silence window;
update the expr in the alert (CortexNovaDoesntFindValidKVMHosts) to use .*No
valid host was found.* instead of .+No valid host was found.+ and add a for: 5m
clause (matching other alerts) so the rule only fires after the condition
persists for 5 minutes.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 6a63d4ee-0c15-48df-bd00-8b0c105d7f73
📒 Files selected for processing (3)
helm/bundles/cortex-nova/alerts/nova.alerts.yamlinternal/knowledge/datasources/plugins/openstack/nova/nova_types.gotools/plutono/provisioning/dashboards/cortex-status.json
Test Coverage ReportTest Coverage 📊: 68.4% |
Errors during Nova scheduling such as
No valid host foundmay indicate that the datacenter is running out of capacity, or cortex filtered out too many hosts from the request. In this case we should be warned so we can take a look at the erroring virtual machines.This change downloads server fault codes and error messages from openstack into a new table
openstack_servers_v2, from where a new kpivm-faultswill take it up and expose erroring vms (including their error message) as a prometheus metric. To get notified, we add an alert on a very specific error message that indicates the highlighted scenario.In addition, the metric will also allow us to see which other errors are currently happening during the scheduling. Finally, we also add a panel to our plutono dashboard that shows the development over time.