Skip to content

feat(chart): add kubelet-probe ingress to NetworkPolicy (M5b #1)#476

Merged
trilamsr merged 2 commits into
mainfrom
feat/m5b-chart-networkpolicy-template
Jun 2, 2026
Merged

feat(chart): add kubelet-probe ingress to NetworkPolicy (M5b #1)#476
trilamsr merged 2 commits into
mainfrom
feat/m5b-chart-networkpolicy-template

Conversation

@trilamsr

@trilamsr trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Adds a kubelet-probe ingress rule to the chart's NetworkPolicy template, closing M5b chart opportunistic #1 (docs/followups/M5b.md).

Root cause. Kubelet liveness/readiness probes originate from the node IP via the host-network namespace. NetworkPolicy v1 cannot match host-network traffic with namespaceSelector or podSelector peers — only with ipBlock. The existing chart NetworkPolicy (issue #301) carved ingress for in-namespace pods (scrape-in) but had no rule the kubelet matched. Result: a networkPolicy.enabled: true install would flip every DaemonSet pod NotReady within one failureThreshold window — the chart would render its own DaemonSet inoperable.

Fix. New networkPolicy.kubeletProbes.{enabled,cidr,except} block. When enabled (default true when the policy is enabled), the template renders an ipBlock ingress rule on the health port (chart default :13133). Default cidr: 0.0.0.0/0 is permissive on source IP but L4-scoped to the healthcheckextension port, so the telemetry + OTLP receiver ports stay locked down. Operators with a fixed node CIDR tighten it in their overlay.

Production preset (values-production.yaml) inherits the default-on posture. Schema (values.schema.json) extended with additionalProperties: false so typos fail at helm install.

chart: NetworkPolicy now carves a port-scoped `ipBlock` ingress rule for kubelet liveness/readiness probes (`networkPolicy.kubeletProbes.*`), so `networkPolicy.enabled: true` no longer breaks the DaemonSet's own readiness flow. Closes M5b chart opportunistic #1.

Cross-references

  • docs/followups/M5b.md — opportunistic-deferral list, item ci(deps): bump the gh-actions group with 5 updates #1 ticked.
  • docs/threat-model.md §6.G — network-surface audit scope this template satisfies (listener inventory + default-deny verification).
  • install/kubernetes/tracecore/README.md §security — operator-facing values walkthrough updated.
  • Builds on #301 (initial scrape-in + OTLP-out scope).

Files changed

  • install/kubernetes/tracecore/templates/networkpolicy.yaml — new ipBlock ingress rule + load-bearing comment block explaining why 0.0.0.0/0 stays narrow.
  • install/kubernetes/tracecore/values.yaml — new networkPolicy.kubeletProbes defaults + comment.
  • install/kubernetes/tracecore/values-production.yaml — inherits defaults explicitly with production-context comment.
  • install/kubernetes/tracecore/values.schema.json — schema for the new block, additionalProperties: false.
  • install/kubernetes/tracecore/README.md — three new values-table rows + updated NetworkPolicy section with threat-model cross-link.
  • docs/followups/M5b.md — item ci(deps): bump the gh-actions group with 5 updates #1 ticked with implementation pointer.

Test plan

  • helm lint install/kubernetes/tracecore — exit 0.
  • helm lint install/kubernetes/tracecore -f values-production.yaml — exit 0.
  • helm template install/kubernetes/tracecore — exit 0; NetworkPolicy NOT rendered (default enabled: false).
  • helm template install/kubernetes/tracecore -f values-production.yaml — exit 0; NetworkPolicy rendered with kubelet-probe ingress rule.
  • Mutation: enabled with empty allowedEgressEndpoints — renders correctly (no DNS / probe rule loss).
  • Mutation: kubeletProbes.enabled: false — probe rule omitted; scrape-in rule unchanged.
  • Mutation: tightened cidr: 10.0.0.0/16 with except: [10.0.99.0/24] — renders ipBlock.cidr + ipBlock.except correctly.
  • conftest test --policy policies/conftest/tracecore.rego on default render — 52/52 passed.
  • conftest test --policy policies/conftest/tracecore.rego on production render — 91/91 passed.
  • kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.30.0 on default render — 4 valid, 0 invalid.
  • kubeconform -strict -ignore-missing-schemas -kubernetes-version 1.30.0 on production render — 6 valid, 0 invalid, 1 skipped (ServiceMonitor CRD).
  • commit-msg hook gates: golangci-lint clean, go vet clean, go mod verify clean, attribute-namespace-check clean.

Grade

A+ — root-cause fix, mutation-verified, conftest + kubeconform + helm-lint all clean, cross-linked to threat-model.md §6.G, explicit policyTypes: [Ingress, Egress] deny-all baseline documented inline, M5b checklist item ticked.

Kubelet liveness/readiness probes originate from the node IP (host
network), which is NOT selectable via namespaceSelector / podSelector.
Without a port-scoped ipBlock rule, every pod flips NotReady within
one failureThreshold window when networkPolicy.enabled is true — the
chart would render its own DaemonSet inoperable.

Adds networkPolicy.kubeletProbes.{enabled,cidr,except} gated on the
existing networkPolicy.enabled toggle. Default cidr=0.0.0.0/0 stays
narrow at L4 because the rule is scoped to the health port only;
operators with a fixed node CIDR tighten it in their overlay.

Production preset (values-production.yaml) inherits enabled: true.
Cross-linked from docs/threat-model.md §6.G (network-surface audit)
and chart README §security.

Closes M5b chart opportunistic #1.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Code Review: PR #476 — Independent Adversarial Review

B/A/A+ Grade: BLOCK (pending simplification)


Finding: Comment Duplication in networkpolicy.yaml

File: install/kubernetes/tracecore/templates/networkpolicy.yaml
Severity: Simplification mandate (nit per review doctrine)

The template contains two overlapping descriptions of the kubelet-probe ingress rule:

  1. Header comment block (lines 1–43): explains root cause, K8s semantics, port scoping, 0.0.0.0/0 justification.
  2. Inline comment before rule (13 lines in spec.ingress): repeats ~95% verbatim.

The header documents why this feature exists; the inline should document what this rule does in 3–4 lines only.

Suggested replacement:

    # kubelet probes originate from node IP (only ipBlock matches).
    # L4-scoped to health port so 0.0.0.0/0 default stays narrow;
    # operators with fixed node CIDR tighten via kubeletProbes.cidr.

Security & Technical Verification ✓

ipBlock for host-network probes: Correct. Kubelet probes originate from node IP (host namespace), not pod namespace. K8s NetworkPolicy only matches them via ipBlock (podSelector/namespaceSelector resolve Kubernetes objects; the kubelet probe is a raw IP with no object). Per-spec.

0.0.0.0/0 CIDR: Justified. L4-scoped to health port (:13133), carries readiness check only, default ON prevents DaemonSet NotReady spiral, comprehensive documentation (values.yaml §kubeletProbes, README.md cross-linked to threat-model.md §6.G).

Schema additionalProperties: false: Consistent with probes block (line 210). Catches typos; not restrictive.

Threat-model cross-link: Section 6.G covers NetworkPolicy audit scope.

Test coverage: helm lint, template, conftest (56 tests), kubeconform all passed. CI green.


Next: Trim inline comment and force-push. All other aspects are A+-grade.

Signed-off-by: Tri Lam <tree@lumalabs.ai>
@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

BLOCK trim applied: inline comment 13→4 lines (port scope + ipBlock rationale only); header carries full context. helm lint + template both clean. Re-requesting review.

@trilamsr

trilamsr commented Jun 2, 2026

Copy link
Copy Markdown
Contributor Author

Re-Review: PR #476 commit 08815c4 (BLOCK → A+)

Per-Fix Verification

1. Comment reduction: 13→4 lines

  • Old: 13-line repeat of header (M5b ref, liveness/readiness semantics, L4 rationale, kube-apiserver limitation, node CIDR guidance)
  • New: 4 lines covering core facts:
    • kubelet probes originate from node IP → ipBlock is the only NetworkPolicy v1 peer that matches
    • Port-scoped to health (:13133)
    • Default 0.0.0.0/0 CIDR stays narrow at L4
    • telemetry + OTLP listeners remain locked
  • Header reference preserved via "(see header)" for operator context

2. ipBlock rationale + :13133 port scope

  • Lines 90–93 explicitly state :13133 (chart default health port)
  • Accurately captures why ipBlock is necessary (host-network kubelet probes unmatchable by pod/namespace selectors)
  • Correctly notes L4 scoping maintains security despite permissive 0.0.0.0/0 default

3. No new noise

  • Header block (lines 22–31) carries full threat-model context
  • Inline comment no longer duplicates; stays focused on rule behavior
  • Port rule renders only health port; telemetry + OTLP remain default-deny
  • No syntax errors, no template issues

4. helm lint + helm template

helm lint: 1 chart linted, 0 failed (INFO: icon recommended only)
helm template: renders cleanly, no undefined variables or control flow errors

5. Test coverage maintained

  • Chart render, install-bench, conftest, policy-matrix all passing
  • No Helm schema validation regression

Simplification Assessment

Trim is load-bearing: resolves comment duplication (review doctrine: document once, link from detail). The 4-line replacement provides exactly what an operator needs at the rule site (port scope + ipBlock rationale) without duplicating broader architectural narrative in header + external docs.


VERDICT: A+ — Ready to merge. BLOCK successfully resolved; no new findings.

@trilamsr trilamsr enabled auto-merge (squash) June 2, 2026 04:22
@trilamsr trilamsr merged commit a68222e into main Jun 2, 2026
20 of 25 checks passed
@trilamsr trilamsr deleted the feat/m5b-chart-networkpolicy-template branch June 2, 2026 04:30
trilamsr added a commit that referenced this pull request Jun 2, 2026
…) (#496)

## Summary

`policy-matrix.yml` workflow has been failing on every chart-touching PR
(blocked #476, #481, #493) since #475 landed. The chart's production
preset (`values-production.yaml`) flips `serviceMonitor.enabled=true`,
which renders a `monitoring.coreos.com/v1 ServiceMonitor` resource. Kind
clusters don't ship the prometheus-operator CRDs, so
`helm install --dry-run=server` exits 1 with:

```
no matches for kind "ServiceMonitor" in version "monitoring.coreos.com/v1"
ensure CRDs are installed first
```

## Root cause

Kind ships only the core Kubernetes API set. `monitoring.coreos.com/v1`
is supplied by prometheus-operator, which the policy-matrix kind cluster
never installs. The chart's `templates/servicemonitor.yaml` is gated by
`.Values.serviceMonitor.enabled` — default `false` (chart stays
first-install-compatible on bare clusters), but the production preset
enables it (kube-prometheus-stack convention). The policy-matrix gate
exercises both default and production presets across PSA / Kyverno /
Gatekeeper, so the production rows hit the missing CRD on every run.

## Fix

Issue #494 recommended option (a) — install the missing CRD prereq.
This PR adds a single workflow step after kind cluster spin-up but
before the smoke script:

```yaml
- name: Install prometheus-operator ServiceMonitor CRD (issue #494)
  run: |
    kubectl apply -f \
      "https://github.com/prometheus-operator/prometheus-operator/v0.91.0/example/prometheus-operator-crd/monitoring.coreos.com_servicemonitors.yaml"
    kubectl wait --for=condition=established crd/servicemonitors.monitoring.coreos.com --timeout=60s
```

Design choices:

- **Slim CRD (ServiceMonitor only)** vs full prometheus-operator bundle.
The chart's production preset references no other
`monitoring.coreos.com`
  kinds. Slim install (~700 lines of YAML) avoids pulling Prometheus,
  Alertmanager, ThanosRuler, PodMonitor, Probe, PrometheusRule we don't
  exercise.
- **Applied to every matrix row** (not just production). A future flip
of
  the default `serviceMonitor.enabled` toggle cannot silently re-break
  this gate.
- **Pinned to `v0.91.0`** (latest stable, published 2026-05-05). Matches
  the existing `KYVERNO_POLICIES_REF` / `GATEKEEPER_VERSION` pin
  convention in `scripts/policy-matrix-smoke.sh`. Bumping is a reviewed
  code change — never tracks `main`.
- **`kubectl wait --for=condition=established`** before the helm dry-run
  so the apiserver has registered the CRD when the chart template
  reaches the admission chain (avoids a race where the dry-run hits
  before discovery refreshes).

## Gatekeeper CRD timing

Re-audited — `install_gatekeeper()` in `scripts/policy-matrix-smoke.sh`
already polls `kubectl get crd ...constraints.gatekeeper.sh` (line
143-149)
and the constraint `byPod[*].enforced` field (line 270-276) before the
smoke step exits. The `kubectl get constraints -A || true` in the
failure-collection step is diagnostic only and already tolerates absent
CRDs. No timing fix needed there.

## Why not install-bench / chart.yml

- `install-bench.yml` uses `bench/install/tracecore-values.yaml` which
  doesn't enable serviceMonitor — same failure shape doesn't apply.
- `chart.yml`'s `install` and `upgrade` jobs install with default values
  (`serviceMonitor.enabled=false`); the `render` job's production-preset
  check is `helm template` only (no cluster), so no API discovery runs.

## Test plan

- [x] `actionlint .github/workflows/policy-matrix.yml` — exit 0
- [x] `actionlint` across all `.github/workflows/` — exit 0
- [x] Pre-push hook suite passed locally (golangci-lint, vet, mod
verify,
  attribute-namespace-check, zizmor, doc-check, alert-check,
  chart-appversion-check, rfc-status-check, slo-rules-check,
  deprecation-check, no-autoupdate-check, test-flake-audit)
- [ ] policy-matrix workflow runs green on this PR — all 6 matrix rows
  (psa × default, psa × production, kyverno × default, kyverno ×
  production, gatekeeper × default, gatekeeper × production) plus all 3
  mutation rows.

Closes #494.

```release-notes
ci: install prometheus-operator ServiceMonitor CRD in policy-matrix kind cluster so chart-touching PRs no longer fail on the production preset's ServiceMonitor render
```

Signed-off-by: Tri Lam <tree@lumalabs.ai>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant