feat(chart): add kubelet-probe ingress to NetworkPolicy (M5b #1)#476
Conversation
Kubelet liveness/readiness probes originate from the node IP (host
network), which is NOT selectable via namespaceSelector / podSelector.
Without a port-scoped ipBlock rule, every pod flips NotReady within
one failureThreshold window when networkPolicy.enabled is true — the
chart would render its own DaemonSet inoperable.
Adds networkPolicy.kubeletProbes.{enabled,cidr,except} gated on the
existing networkPolicy.enabled toggle. Default cidr=0.0.0.0/0 stays
narrow at L4 because the rule is scoped to the health port only;
operators with a fixed node CIDR tighten it in their overlay.
Production preset (values-production.yaml) inherits enabled: true.
Cross-linked from docs/threat-model.md §6.G (network-surface audit)
and chart README §security.
Closes M5b chart opportunistic #1.
Signed-off-by: Tri Lam <tree@lumalabs.ai>
Code Review: PR #476 — Independent Adversarial ReviewB/A/A+ Grade: BLOCK (pending simplification)Finding: Comment Duplication in networkpolicy.yamlFile: The template contains two overlapping descriptions of the kubelet-probe ingress rule:
The header documents why this feature exists; the inline should document what this rule does in 3–4 lines only. Suggested replacement: # kubelet probes originate from node IP (only ipBlock matches).
# L4-scoped to health port so 0.0.0.0/0 default stays narrow;
# operators with fixed node CIDR tighten via kubeletProbes.cidr.Security & Technical Verification ✓ipBlock for host-network probes: Correct. Kubelet probes originate from node IP (host namespace), not pod namespace. K8s NetworkPolicy only matches them via ipBlock (podSelector/namespaceSelector resolve Kubernetes objects; the kubelet probe is a raw IP with no object). Per-spec. 0.0.0.0/0 CIDR: Justified. L4-scoped to health port (:13133), carries readiness check only, default ON prevents DaemonSet NotReady spiral, comprehensive documentation (values.yaml §kubeletProbes, README.md cross-linked to threat-model.md §6.G). Schema Threat-model cross-link: Section 6.G covers NetworkPolicy audit scope. Test coverage: helm lint, template, conftest (56 tests), kubeconform all passed. CI green. Next: Trim inline comment and force-push. All other aspects are A+-grade. |
Signed-off-by: Tri Lam <tree@lumalabs.ai>
|
BLOCK trim applied: inline comment 13→4 lines (port scope + ipBlock rationale only); header carries full context. helm lint + template both clean. Re-requesting review. |
Re-Review: PR #476 commit 08815c4 (BLOCK → A+)Per-Fix Verification1. Comment reduction: 13→4 lines ✓
2. ipBlock rationale + :13133 port scope ✓
3. No new noise ✓
4. helm lint + helm template ✓ 5. Test coverage maintained ✓
Simplification AssessmentTrim is load-bearing: resolves comment duplication (review doctrine: document once, link from detail). The 4-line replacement provides exactly what an operator needs at the rule site (port scope + ipBlock rationale) without duplicating broader architectural narrative in header + external docs. VERDICT: A+ — Ready to merge. BLOCK successfully resolved; no new findings. |
…) (#496) ## Summary `policy-matrix.yml` workflow has been failing on every chart-touching PR (blocked #476, #481, #493) since #475 landed. The chart's production preset (`values-production.yaml`) flips `serviceMonitor.enabled=true`, which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource. Kind clusters don't ship the prometheus-operator CRDs, so `helm install --dry-run=server` exits 1 with: ``` no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" ensure CRDs are installed first ``` ## Root cause Kind ships only the core Kubernetes API set. `monitoring.coreos.com/v1` is supplied by prometheus-operator, which the policy-matrix kind cluster never installs. The chart's `templates/servicemonitor.yaml` is gated by `.Values.serviceMonitor.enabled` — default `false` (chart stays first-install-compatible on bare clusters), but the production preset enables it (kube-prometheus-stack convention). The policy-matrix gate exercises both default and production presets across PSA / Kyverno / Gatekeeper, so the production rows hit the missing CRD on every run. ## Fix Issue #494 recommended option (a) — install the missing CRD prereq. This PR adds a single workflow step after kind cluster spin-up but before the smoke script: ```yaml - name: Install prometheus-operator ServiceMonitor CRD (issue #494) run: | kubectl apply -f \ "https://github.com/prometheus-operator/prometheus-operator/v0.91.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml" kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s ``` Design choices: - **Slim CRD (ServiceMonitor only)** vs full prometheus-operator bundle. The chart's production preset references no other `monitoring.coreos.com` kinds. Slim install (~700 lines of YAML) avoids pulling Prometheus, Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't exercise. - **Applied to every matrix row** (not just production). A future flip of the default `serviceMonitor.enabled` toggle cannot silently re-break this gate. - **Pinned to `v0.91.0`** (latest stable, published 2026-05-05). Matches the existing `KYVERNO_POLICIES_REF` / `GATEKEEPER_VERSION` pin convention in `scripts/policy-matrix-smoke.sh`. Bumping is a reviewed code change — never tracks `main`. - **`kubectl wait --for=condition=established`** before the helm dry-run so the apiserver has registered the CRD when the chart template reaches the admission chain (avoids a race where the dry-run hits before discovery refreshes). ## Gatekeeper CRD timing Re-audited — `install_gatekeeper()` in `scripts/policy-matrix-smoke.sh` already polls `kubectl get crd ...constraints.gatekeeper.sh` (line 143-149) and the constraint `byPod[*].enforced` field (line 270-276) before the smoke step exits. The `kubectl get constraints -A || true` in the failure-collection step is diagnostic only and already tolerates absent CRDs. No timing fix needed there. ## Why not install-bench / chart.yml - `install-bench.yml` uses `bench/install/tracecore-values.yaml` which doesn't enable serviceMonitor — same failure shape doesn't apply. - `chart.yml`'s `install` and `upgrade` jobs install with default values (`serviceMonitor.enabled=false`); the `render` job's production-preset check is `helm template` only (no cluster), so no API discovery runs. ## Test plan - [x] `actionlint .github/workflows/policy-matrix.yml` — exit 0 - [x] `actionlint` across all `.github/workflows/` — exit 0 - [x] Pre-push hook suite passed locally (golangci-lint, vet, mod verify, attribute-namespace-check, zizmor, doc-check, alert-check, chart-appversion-check, rfc-status-check, slo-rules-check, deprecation-check, no-autoupdate-check, test-flake-audit) - [ ] policy-matrix workflow runs green on this PR — all 6 matrix rows (psa × default, psa × production, kyverno × default, kyverno × production, gatekeeper × default, gatekeeper × production) plus all 3 mutation rows. Closes #494. ```release-notes ci: install prometheus-operator ServiceMonitor CRD in policy-matrix kind cluster so chart-touching PRs no longer fail on the production preset's ServiceMonitor render ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>
Summary
Adds a kubelet-probe ingress rule to the chart's
NetworkPolicytemplate, closing M5b chart opportunistic #1 (docs/followups/M5b.md).Root cause. Kubelet liveness/readiness probes originate from the node IP via the host-network namespace. NetworkPolicy v1 cannot match host-network traffic with
namespaceSelectororpodSelectorpeers — only withipBlock. The existing chartNetworkPolicy(issue #301) carved ingress for in-namespace pods (scrape-in) but had no rule the kubelet matched. Result: anetworkPolicy.enabled: trueinstall would flip every DaemonSet podNotReadywithin onefailureThresholdwindow — the chart would render its own DaemonSet inoperable.Fix. New
networkPolicy.kubeletProbes.{enabled,cidr,except}block. When enabled (defaulttruewhen the policy is enabled), the template renders anipBlockingress rule on thehealthport (chart default:13133). Defaultcidr: 0.0.0.0/0is permissive on source IP but L4-scoped to the healthcheckextension port, so the telemetry + OTLP receiver ports stay locked down. Operators with a fixed node CIDR tighten it in their overlay.Production preset (
values-production.yaml) inherits the default-on posture. Schema (values.schema.json) extended withadditionalProperties: falseso typos fail athelm install.Cross-references
docs/followups/M5b.md— opportunistic-deferral list, item ci(deps): bump the gh-actions group with 5 updates #1 ticked.docs/threat-model.md§6.G — network-surface audit scope this template satisfies (listener inventory + default-deny verification).install/kubernetes/tracecore/README.md§security — operator-facing values walkthrough updated.#301(initial scrape-in + OTLP-out scope).Files changed
install/kubernetes/tracecore/templates/networkpolicy.yaml— newipBlockingress rule + load-bearing comment block explaining why0.0.0.0/0stays narrow.install/kubernetes/tracecore/values.yaml— newnetworkPolicy.kubeletProbesdefaults + comment.install/kubernetes/tracecore/values-production.yaml— inherits defaults explicitly with production-context comment.install/kubernetes/tracecore/values.schema.json— schema for the new block,additionalProperties: false.install/kubernetes/tracecore/README.md— three new values-table rows + updated NetworkPolicy section with threat-model cross-link.docs/followups/M5b.md— item ci(deps): bump the gh-actions group with 5 updates #1 ticked with implementation pointer.Test plan
helm lint install/kubernetes/tracecore— exit 0.helm lint install/kubernetes/tracecore -f values-production.yaml— exit 0.helm template install/kubernetes/tracecore— exit 0; NetworkPolicy NOT rendered (defaultenabled: false).helm template install/kubernetes/tracecore -f values-production.yaml— exit 0; NetworkPolicy rendered with kubelet-probe ingress rule.allowedEgressEndpoints— renders correctly (no DNS / probe rule loss).kubeletProbes.enabled: false— probe rule omitted; scrape-in rule unchanged.cidr: 10.0.0.0/16withexcept: [10.0.99.0/24]— rendersipBlock.cidr+ipBlock.exceptcorrectly.conftest test --policy policies/conftest/tracecore.regoon default render — 52/52 passed.conftest test --policy policies/conftest/tracecore.regoon production render — 91/91 passed.kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.30.0on default render — 4 valid, 0 invalid.kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.30.0on production render — 6 valid, 0 invalid, 1 skipped (ServiceMonitor CRD).Grade
A+ — root-cause fix, mutation-verified, conftest + kubeconform + helm-lint all clean, cross-linked to threat-model.md §6.G, explicit
policyTypes: [Ingress, Egress]deny-all baseline documented inline, M5b checklist item ticked.