feat(chart): add kubelet-probe ingress to NetworkPolicy (M5b #1) by trilamsr · Pull Request #476 · TraceCoreAI/tracecore

trilamsr · 2026-06-02T04:09:25Z

Summary

Adds a kubelet-probe ingress rule to the chart's NetworkPolicy template, closing M5b chart opportunistic #1 (docs/followups/M5b.md).

Root cause. Kubelet liveness/readiness probes originate from the node IP via the host-network namespace. NetworkPolicy v1 cannot match host-network traffic with namespaceSelector or podSelector peers — only with ipBlock. The existing chart NetworkPolicy (issue #301) carved ingress for in-namespace pods (scrape-in) but had no rule the kubelet matched. Result: a networkPolicy.enabled: true install would flip every DaemonSet pod NotReady within one failureThreshold window — the chart would render its own DaemonSet inoperable.

Fix. New networkPolicy.kubeletProbes.{enabled,cidr,except} block. When enabled (default true when the policy is enabled), the template renders an ipBlock ingress rule on the health port (chart default :13133). Default cidr: 0.0.0.0/0 is permissive on source IP but L4-scoped to the healthcheckextension port, so the telemetry + OTLP receiver ports stay locked down. Operators with a fixed node CIDR tighten it in their overlay.

Production preset (values-production.yaml) inherits the default-on posture. Schema (values.schema.json) extended with additionalProperties: false so typos fail at helm install.

chart: NetworkPolicy now carves a port-scoped `ipBlock` ingress rule for kubelet liveness/readiness probes (`networkPolicy.kubeletProbes.*`), so `networkPolicy.enabled: true` no longer breaks the DaemonSet's own readiness flow. Closes M5b chart opportunistic #1.

Cross-references

docs/followups/M5b.md — opportunistic-deferral list, item ci(deps): bump the gh-actions group with 5 updates #1 ticked.
docs/threat-model.md §6.G — network-surface audit scope this template satisfies (listener inventory + default-deny verification).
install/kubernetes/tracecore/README.md §security — operator-facing values walkthrough updated.
Builds on #301 (initial scrape-in + OTLP-out scope).

Files changed

install/kubernetes/tracecore/templates/networkpolicy.yaml — new ipBlock ingress rule + load-bearing comment block explaining why 0.0.0.0/0 stays narrow.
install/kubernetes/tracecore/values.yaml — new networkPolicy.kubeletProbes defaults + comment.
install/kubernetes/tracecore/values-production.yaml — inherits defaults explicitly with production-context comment.
install/kubernetes/tracecore/values.schema.json — schema for the new block, additionalProperties: false.
install/kubernetes/tracecore/README.md — three new values-table rows + updated NetworkPolicy section with threat-model cross-link.
docs/followups/M5b.md — item ci(deps): bump the gh-actions group with 5 updates #1 ticked with implementation pointer.

Test plan

Grade

A+ — root-cause fix, mutation-verified, conftest + kubeconform + helm-lint all clean, cross-linked to threat-model.md §6.G, explicit policyTypes: [Ingress, Egress] deny-all baseline documented inline, M5b checklist item ticked.

Kubelet liveness/readiness probes originate from the node IP (host network), which is NOT selectable via namespaceSelector / podSelector. Without a port-scoped ipBlock rule, every pod flips NotReady within one failureThreshold window when networkPolicy.enabled is true — the chart would render its own DaemonSet inoperable. Adds networkPolicy.kubeletProbes.{enabled,cidr,except} gated on the existing networkPolicy.enabled toggle. Default cidr=0.0.0.0/0 stays narrow at L4 because the rule is scoped to the health port only; operators with a fixed node CIDR tighten it in their overlay. Production preset (values-production.yaml) inherits enabled: true. Cross-linked from docs/threat-model.md §6.G (network-surface audit) and chart README §security. Closes M5b chart opportunistic #1. Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr · 2026-06-02T04:14:59Z

Code Review: PR #476 — Independent Adversarial Review

B/A/A+ Grade: BLOCK (pending simplification)

Finding: Comment Duplication in networkpolicy.yaml

File: install/kubernetes/tracecore/templates/networkpolicy.yaml
Severity: Simplification mandate (nit per review doctrine)

The template contains two overlapping descriptions of the kubelet-probe ingress rule:

Header comment block (lines 1–43): explains root cause, K8s semantics, port scoping, 0.0.0.0/0 justification.
Inline comment before rule (13 lines in spec.ingress): repeats ~95% verbatim.

The header documents why this feature exists; the inline should document what this rule does in 3–4 lines only.

Suggested replacement:

    # kubelet probes originate from node IP (only ipBlock matches).
    # L4-scoped to health port so 0.0.0.0/0 default stays narrow;
    # operators with fixed node CIDR tighten via kubeletProbes.cidr.

Security & Technical Verification ✓

ipBlock for host-network probes: Correct. Kubelet probes originate from node IP (host namespace), not pod namespace. K8s NetworkPolicy only matches them via ipBlock (podSelector/namespaceSelector resolve Kubernetes objects; the kubelet probe is a raw IP with no object). Per-spec.

0.0.0.0/0 CIDR: Justified. L4-scoped to health port (:13133), carries readiness check only, default ON prevents DaemonSet NotReady spiral, comprehensive documentation (values.yaml §kubeletProbes, README.md cross-linked to threat-model.md §6.G).

Schema additionalProperties: false: Consistent with probes block (line 210). Catches typos; not restrictive.

Threat-model cross-link: Section 6.G covers NetworkPolicy audit scope.

Test coverage: helm lint, template, conftest (56 tests), kubeconform all passed. CI green.

Next: Trim inline comment and force-push. All other aspects are A+-grade.

Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr · 2026-06-02T04:20:12Z

BLOCK trim applied: inline comment 13→4 lines (port scope + ipBlock rationale only); header carries full context. helm lint + template both clean. Re-requesting review.

trilamsr · 2026-06-02T04:22:24Z

Re-Review: PR #476 commit `08815c4` (BLOCK → A+)

Per-Fix Verification

1. Comment reduction: 13→4 lines ✓

Old: 13-line repeat of header (M5b ref, liveness/readiness semantics, L4 rationale, kube-apiserver limitation, node CIDR guidance)
New: 4 lines covering core facts:
- kubelet probes originate from node IP → ipBlock is the only NetworkPolicy v1 peer that matches
- Port-scoped to health (:13133)
- Default 0.0.0.0/0 CIDR stays narrow at L4
- telemetry + OTLP listeners remain locked
Header reference preserved via "(see header)" for operator context

2. ipBlock rationale + :13133 port scope ✓

Lines 90–93 explicitly state :13133 (chart default health port)
Accurately captures why ipBlock is necessary (host-network kubelet probes unmatchable by pod/namespace selectors)
Correctly notes L4 scoping maintains security despite permissive 0.0.0.0/0 default

3. No new noise ✓

Header block (lines 22–31) carries full threat-model context
Inline comment no longer duplicates; stays focused on rule behavior
Port rule renders only health port; telemetry + OTLP remain default-deny
No syntax errors, no template issues

4. helm lint + helm template ✓

helm lint: 1 chart linted, 0 failed (INFO: icon recommended only)
helm template: renders cleanly, no undefined variables or control flow errors

5. Test coverage maintained ✓

Chart render, install-bench, conftest, policy-matrix all passing
No Helm schema validation regression

Simplification Assessment

Trim is load-bearing: resolves comment duplication (review doctrine: document once, link from detail). The 4-line replacement provides exactly what an operator needs at the rule site (port scope + ipBlock rationale) without duplicating broader architectural narrative in header + external docs.

VERDICT: A+ — Ready to merge. BLOCK successfully resolved; no new findings.

…) (#496) ## Summary `policy-matrix.yml` workflow has been failing on every chart-touching PR (blocked #476, #481, #493) since #475 landed. The chart's production preset (`values-production.yaml`) flips `serviceMonitor.enabled=true`, which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource. Kind clusters don't ship the prometheus-operator CRDs, so `helm install --dry-run=server` exits 1 with: ``` no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1" ensure CRDs are installed first ``` ## Root cause Kind ships only the core Kubernetes API set. `monitoring.coreos.com/v1` is supplied by prometheus-operator, which the policy-matrix kind cluster never installs. The chart's `templates/servicemonitor.yaml` is gated by `.Values.serviceMonitor.enabled` — default `false` (chart stays first-install-compatible on bare clusters), but the production preset enables it (kube-prometheus-stack convention). The policy-matrix gate exercises both default and production presets across PSA / Kyverno / Gatekeeper, so the production rows hit the missing CRD on every run. ## Fix Issue #494 recommended option (a) — install the missing CRD prereq. This PR adds a single workflow step after kind cluster spin-up but before the smoke script: ```yaml - name: Install prometheus-operator ServiceMonitor CRD (issue #494) run: | kubectl apply -f \ "https://github.com/prometheus-operator/prometheus-operator/v0.91.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml" kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s ``` Design choices: - **Slim CRD (ServiceMonitor only)** vs full prometheus-operator bundle. The chart's production preset references no other `monitoring.coreos.com` kinds. Slim install (~700 lines of YAML) avoids pulling Prometheus, Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't exercise. - **Applied to every matrix row** (not just production). A future flip of the default `serviceMonitor.enabled` toggle cannot silently re-break this gate. - **Pinned to `v0.91.0`** (latest stable, published 2026-05-05). Matches the existing `KYVERNO_POLICIES_REF` / `GATEKEEPER_VERSION` pin convention in `scripts/policy-matrix-smoke.sh`. Bumping is a reviewed code change — never tracks `main`. - **`kubectl wait --for=condition=established`** before the helm dry-run so the apiserver has registered the CRD when the chart template reaches the admission chain (avoids a race where the dry-run hits before discovery refreshes). ## Gatekeeper CRD timing Re-audited — `install_gatekeeper()` in `scripts/policy-matrix-smoke.sh` already polls `kubectl get crd ...constraints.gatekeeper.sh` (line 143-149) and the constraint `byPod[*].enforced` field (line 270-276) before the smoke step exits. The `kubectl get constraints -A || true` in the failure-collection step is diagnostic only and already tolerates absent CRDs. No timing fix needed there. ## Why not install-bench / chart.yml - `install-bench.yml` uses `bench/install/tracecore-values.yaml` which doesn't enable serviceMonitor — same failure shape doesn't apply. - `chart.yml`'s `install` and `upgrade` jobs install with default values (`serviceMonitor.enabled=false`); the `render` job's production-preset check is `helm template` only (no cluster), so no API discovery runs. ## Test plan - [x] `actionlint .github/workflows/policy-matrix.yml` — exit 0 - [x] `actionlint` across all `.github/workflows/` — exit 0 - [x] Pre-push hook suite passed locally (golangci-lint, vet, mod verify, attribute-namespace-check, zizmor, doc-check, alert-check, chart-appversion-check, rfc-status-check, slo-rules-check, deprecation-check, no-autoupdate-check, test-flake-audit) - [ ] policy-matrix workflow runs green on this PR — all 6 matrix rows (psa × default, psa × production, kyverno × default, kyverno × production, gatekeeper × default, gatekeeper × production) plus all 3 mutation rows. Closes #494. ```release-notes ci: install prometheus-operator ServiceMonitor CRD in policy-matrix kind cluster so chart-touching PRs no longer fail on the production preset's ServiceMonitor render ``` Signed-off-by: Tri Lam <tree@lumalabs.ai>

refactor(m5b): trim duplicate kubelet-probe comment

08815c4

Signed-off-by: Tri Lam <tree@lumalabs.ai>

trilamsr enabled auto-merge (squash) June 2, 2026 04:22

trilamsr merged commit a68222e into main Jun 2, 2026
20 of 25 checks passed

trilamsr deleted the feat/m5b-chart-networkpolicy-template branch June 2, 2026 04:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(chart): add kubelet-probe ingress to NetworkPolicy (M5b #1)#476

feat(chart): add kubelet-probe ingress to NetworkPolicy (M5b #1)#476
trilamsr merged 2 commits into
mainfrom
feat/m5b-chart-networkpolicy-template

trilamsr commented Jun 2, 2026

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

trilamsr commented Jun 2, 2026

Summary

Cross-references

Files changed

Test plan

Grade

Uh oh!

trilamsr commented Jun 2, 2026

Code Review: PR #476 — Independent Adversarial Review

B/A/A+ Grade: BLOCK (pending simplification)

Finding: Comment Duplication in networkpolicy.yaml

Security & Technical Verification ✓

Uh oh!

trilamsr commented Jun 2, 2026

Uh oh!

trilamsr commented Jun 2, 2026

Re-Review: PR #476 commit 08815c4 (BLOCK → A+)

Per-Fix Verification

Simplification Assessment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Re-Review: PR #476 commit `08815c4` (BLOCK → A+)