diff --git a/CHANGELOG.md b/CHANGELOG.md index 369cf23f..e58c8fe0 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -9,7 +9,9 @@ Pre-alpha. **Distribution-first pivot adopted ([RFC-0013](docs/rfcs/0013-distro- Pivot landed across three waves of PRs: - Wave 1 (#166 RFC doc accepted, #168 delete kueue + kineto receivers, #169 pre-PR-A drift sweep + Helm security tighten, #170 containerstdout deletion explicit in §7, #171 PR-A OCB skeleton + `builder-config.yaml` + `make build-ocb`, #172 dedup gate execution, #173 rename check tiers + add PR body-artifact guard, #174 PR-C release pipeline → goreleaser stack + RFC supersession + top-level doc alignment, #175 wave-1 self-review fixes + delete archive folder). - Wave 2 (#176 PR-D image build → ko + `_build/` walker fix + PR-B reframe as side-effect of binary swap, #177 build-ocb CI gate, #178 post-wave-2 drift sweep, #179 v0.1→v0.2 migration guide skeleton). -- Wave 3 (PR-E: bench heartbeat swap `clockreceiver` → `hostmetricsreceiver`). +- Wave 3 (PR-E: bench heartbeat swap `clockreceiver` → `hostmetricsreceiver`; PR-J: ship the four receiver-side recipes that replace the deleted in-tree receivers — filelog+container, journald+filelog+OTTL, k8sobjects+transform, prometheusreceiver). + +**PR-J landed: four receiver-side integration recipes for the v0.2.0 swap.** New docs ship under `docs/integrations/`: [`filelog-container.md`](docs/integrations/filelog-container.md) (replaces `containerstdout` — `filelogreceiver` with the container parser stanza, `k8sattributesprocessor`, and `file_storage` for restart-safe checkpoints), [`journald-kernel.md`](docs/integrations/journald-kernel.md) (replaces `kernelevents` — `journaldreceiver` + `filelogreceiver` on `/dev/kmsg` + OTTL `transform` that preserves the customer-stable `kernelevents.xid` and `gpu.id` attributes from RFC-0013 §3), [`k8sobjects-events.md`](docs/integrations/k8sobjects-events.md) (replaces `k8sevents` — `k8sobjectsreceiver` watch mode + OTTL `transform` that derives the eleven-entry `k8s.event.hint` enum), and [`prometheus-scrape.md`](docs/integrations/prometheus-scrape.md) (replaces `dcgm` + `kueue` — generic `prometheusreceiver` scrape with the four GPU vendor exporters tabulated and an OTTL stamp of the `gpu.vendor` resource attribute). Every recipe ships a matching `docs/integrations/examples/*.yaml` validated end-to-end by `make validator-recipe` against the OCB-built `./_build/tracecore validate`. The k8sobjects recipe introduces a new `` marker recognized by both `scripts/doc-check.sh` (accepted) and `scripts/validator-recipe.sh` (skipped with a named log line) because the upstream `k8sobjectsreceiver`'s `Validate()` enumerates server-preferred resources via the discovery client and therefore cannot be exercised offline — its example is gated by the kind-cluster job that runs the chart. Updates `docs/migration/v0.1-to-v0.2.md` to flip the PR-J open-item to done with file pointers. CHANGELOG only — no operator-visible runtime change; v0.2.0 release still gates on PR-K (in-tree-receiver deletion) and PR-L (final migration guide body). **PR-E unblocked.** Original RFC-0013 §migration plan named `telemetrygeneratorreceiver` as the upstream replacement for `clockreceiver`. Verified 2026-05-30: the receiver does not exist in `opentelemetry-collector-contrib` at any tag from v0.95.0 through v0.130.0; two community proposals (contrib issues #41687 and #43657) were closed `not_planned`. Replacement landed on `hostmetricsreceiver` (loadscraper @ 1s) — an upstream OCB-bundled receiver that emits 3 low-cardinality series (`system.cpu.load_average.{1m,5m,15m}`) at the cadence the bench's pass condition needs (first parseable JSON line at the sink — see `bench/install/run.sh`). This PR adds `hostmetricsreceiver` to `builder-config.yaml`, adds a `receivers.hostmetrics` opt-in block to the chart values (default disabled — chart default stays `clockreceiver` this release), and flips `bench/install/tracecore-values.yaml` to enable hostmetrics + disable clockreceiver. RFC-0013 §migration PR-E + §4 + §7 deletion table updated. Chart-default flip from `clockreceiver` to `hostmetrics` + source-deletion of `components/receivers/clockreceiver/` are deferred to PR-K (in-tree-receiver deletion wave) so the values-keys migration ships together with `NOTES.txt` deprecation warnings and the coordinated migration of ~92 in-tree test-fixture references in one cut rather than two operator-visible changes. diff --git a/docs/README.md b/docs/README.md index 78b8b4b5..5f4ecbf9 100644 --- a/docs/README.md +++ b/docs/README.md @@ -40,6 +40,8 @@ Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 exter ## Integrations +Backend (exporter-side) recipes: + | File | Audience | Purpose | |---|---|---| | [integrations/otel-backend.md](integrations/otel-backend.md) | 👤 | OTLP/HTTP to a generic OpenTelemetry Collector via the in-tree `otlphttp` exporter. | @@ -47,6 +49,15 @@ Legend: 👤 operator · 🛠️ contributor · 🏛️ maintainer · 🌐 exter | [integrations/datadog.md](integrations/datadog.md) | 👤 | Datadog via the bundled `datadogexporter`. | | [integrations/clickhouse-direct.md](integrations/clickhouse-direct.md) | 👤 | Self-hosted ClickHouse via the bundled `clickhouseexporter`. | +Source (receiver-side) recipes — RFC-0013 §migration PR-J replacements for the deleted in-tree receivers: + +| File | Audience | Purpose | +|---|---|---| +| [integrations/filelog-container.md](integrations/filelog-container.md) | 👤 | Container stdout/stderr tailing via `filelogreceiver` + container parser + `k8sattributesprocessor` + `file_storage`. Replaces `containerstdout`. | +| [integrations/journald-kernel.md](integrations/journald-kernel.md) | 👤 | Kernel + systemd events via `journaldreceiver` + `filelogreceiver` (kmsg) + OTTL transform preserving `kernelevents.xid` / `gpu.id`. Replaces `kernelevents`. | +| [integrations/k8sobjects-events.md](integrations/k8sobjects-events.md) | 👤 | Kubernetes events via `k8sobjectsreceiver` + OTTL transform preserving the eleven-entry `k8s.event.hint` enum. Replaces `k8sevents`. | +| [integrations/prometheus-scrape.md](integrations/prometheus-scrape.md) | 👤 | Generic Prometheus scrape via `prometheusreceiver` (dcgm-exporter, AMD/Intel/Habana exporters, Kueue) + OTTL `gpu.vendor` normalization. Replaces `dcgm` and `kueue`. | + ## Per-component docs | Path | Audience | Purpose | diff --git a/docs/integrations/examples/filelog-container.yaml b/docs/integrations/examples/filelog-container.yaml new file mode 100644 index 00000000..a29d7933 --- /dev/null +++ b/docs/integrations/examples/filelog-container.yaml @@ -0,0 +1,94 @@ +# Container stdout/stderr tailing via the upstream `filelogreceiver` +# with a container parser stanza, k8sattributes enrichment, and the +# `file_storage` extension for restart-safe checkpointing. Replaces +# the in-tree `containerstdout` receiver scheduled for deletion at +# v0.2.0 per RFC-0013 §migration PR-K + §7. The OCB-assembled +# tracecore binary bundles every component below; `tracecore +# validate` covers this file. +# +# Deployment shape: tracecore DaemonSet -> reads /var/log/pods/*/* on +# each node -> filelogreceiver parses CRI/JSON -> k8sattributesprocessor +# enriches with pod/namespace/labels -> otlphttpexporter to backend. +# Replace the OTLP endpoint placeholder at deploy time; tracecore does +# not expand environment variables in YAML, so render the literal with +# a secret-injection tool (envsubst, External Secrets, sealed-secrets) +# before `helm install`. See docs/integrations/filelog-container.md. + +extensions: + file_storage/checkpoints: + directory: /var/lib/tracecore/filelog + create_directory: true + timeout: 1s + compaction: + directory: /var/lib/tracecore/filelog + on_start: true + on_rebound: true + +receivers: + filelog/container: + include: + - /var/log/pods/*/*/*.log + exclude: + - /var/log/pods/*/tracecore/*.log + start_at: end + include_file_path: true + include_file_name: false + storage: file_storage/checkpoints + operators: + - id: container-parser + type: container + format: auto + add_metadata_from_filepath: true + - id: severity-parser + type: severity_parser + parse_from: attributes.stream + mapping: + error: stderr + info: stdout + if: 'attributes.stream != nil' + +processors: + k8sattributes: + auth_type: serviceAccount + passthrough: false + extract: + metadata: + - k8s.namespace.name + - k8s.pod.name + - k8s.pod.uid + - k8s.deployment.name + - k8s.statefulset.name + - k8s.daemonset.name + - k8s.node.name + - k8s.container.name + labels: + - tag_name: app + key: app.kubernetes.io/name + from: pod + pod_association: + - sources: + - from: resource_attribute + name: k8s.pod.uid + - sources: + - from: resource_attribute + name: k8s.namespace.name + - from: resource_attribute + name: k8s.pod.name + batch: + send_batch_size: 8192 + timeout: 5s + send_batch_max_size: 16384 + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + extensions: [file_storage/checkpoints] + pipelines: + logs/container: + receivers: [filelog/container] + processors: [k8sattributes, batch] + exporters: [otlphttp] diff --git a/docs/integrations/examples/journald-kernel.yaml b/docs/integrations/examples/journald-kernel.yaml new file mode 100644 index 00000000..d9f8837e --- /dev/null +++ b/docs/integrations/examples/journald-kernel.yaml @@ -0,0 +1,100 @@ +# Kernel + systemd events via upstream `journaldreceiver` (journald +# stream) + `filelogreceiver` (raw /dev/kmsg for kernel ring buffer), +# normalized to OTel log records via OTTL `transform` so operator +# alerts against `kernelevents.xid` survive the swap per RFC-0013 +# §3 (customer-stable contracts). Replaces the in-tree +# `kernelevents` receiver scheduled for deletion at v0.2.0 per +# RFC-0013 §migration PR-K + §7. The OCB-assembled tracecore binary +# bundles every component below; `tracecore validate` covers this +# file. +# +# Deployment shape: tracecore DaemonSet -> reads /var/log/journal +# + /dev/kmsg on each node (hostPath mounts) -> OTTL transform +# normalizes severity + extracts NVRM Xid -> otlphttpexporter to +# backend. Replace the OTLP endpoint placeholder at deploy time; +# tracecore does not expand environment variables in YAML, so render +# the literal with a secret-injection tool (envsubst, External +# Secrets, sealed-secrets) before `helm install`. See +# docs/integrations/journald-kernel.md. + +receivers: + journald: + directory: /var/log/journal + units: + - kubelet.service + - containerd.service + - systemd-networkd.service + priority: info + filelog/kmsg: + include: + - /dev/kmsg + start_at: end + include_file_path: false + operators: + - id: kmsg-parser + type: regex_parser + regex: '^(?P\d+),(?P\d+),(?P\d+),(?P\w+);(?P.*)$' + # Map syslog numeric priority (0-7) to OTel severity. The + # numeric values are quoted because severity_parser's + # mapping accepts string keys. + severity: + parse_from: attributes.priority + mapping: + error: "3" + warn: "4" + info: "6" + debug: "7" + +processors: + # Kmsg lines arrive with body=string (the raw `/dev/kmsg` line) + # so IsMatch / ExtractPatterns on body type-check correctly. The + # NVRM Xid signal only appears here, never in journald output. + transform/kmsg_xid: + log_statements: + - context: log + # OTTL statements are single-quoted to prevent YAML from + # interpreting embedded `:` (inside regex character classes) + # as a map-key separator. Without the quotes the parser + # rejects the value as `type=string cannot be used as a + # Conf` at validate time. + statements: + # NVRM Xid extraction preserves the customer-stable + # `kernelevents.xid` attribute across the receiver swap. + - 'set(attributes["kernelevents.xid"], Int(ExtractPatterns(body, "NVRM: Xid \\(PCI:[0-9a-fA-F:.]+\\): (?P\\d+)")["xid"])) where IsMatch(body, "NVRM: Xid")' + # gpu.id (PCI BDF) extracted from the Xid line where + # present; matches §3 customer-stable contract. + - 'set(attributes["gpu.id"], ExtractPatterns(body, "NVRM: Xid \\((?PPCI:[0-9a-fA-F:.]+)\\)")["bdf"]) where IsMatch(body, "NVRM: Xid")' + # Journald records arrive with body=map (every journald field + # keyed verbatim, e.g. body["_SYSTEMD_UNIT"]). Lifting the unit + # name to the standard `service.name` resource attribute lets + # downstream filters speak the OTel convention instead of the + # journald-specific underscored names. + transform/journald_service_name: + log_statements: + - context: log + statements: + - 'set(attributes["service.name"], body["_SYSTEMD_UNIT"]) where body["_SYSTEMD_UNIT"] != nil' + batch: + send_batch_size: 4096 + timeout: 5s + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + pipelines: + # Separate pipelines per receiver so each OTTL transform sees + # the body shape it was written against (string for kmsg, map + # for journald). A single shared pipeline would force the NVRM + # Xid IsMatch to run against a map and runtime-error. + logs/kmsg: + receivers: [filelog/kmsg] + processors: [transform/kmsg_xid, batch] + exporters: [otlphttp] + logs/journald: + receivers: [journald] + processors: [transform/journald_service_name, batch] + exporters: [otlphttp] diff --git a/docs/integrations/examples/k8sobjects-events.yaml b/docs/integrations/examples/k8sobjects-events.yaml new file mode 100644 index 00000000..ed76e248 --- /dev/null +++ b/docs/integrations/examples/k8sobjects-events.yaml @@ -0,0 +1,64 @@ +# Kubernetes Events API via upstream `k8sobjectsreceiver` in watch +# mode, normalized through OTTL `transform` to populate the +# customer-stable `k8s.event.hint` attribute (RFC-0013 §3 - the +# 11-entry enum pod_evicted / mount_failure / backoff / oom_killed / +# node_unhealthy / schedule_failure / create_failure / +# volume_attach_failure / container_status_unknown / node_pressure / +# image_pull_failure). Replaces the in-tree `k8sevents` receiver +# scheduled for deletion at v0.2.0 per RFC-0013 §migration PR-K + +# §7. The OCB-assembled tracecore binary bundles every component +# below; `tracecore validate` covers this file. +# +# Deployment shape: tracecore Deployment (single replica, not +# DaemonSet - one watcher per cluster) -> watches core/v1 Events -> +# OTTL transform derives `k8s.event.hint` from event.reason -> +# otlphttpexporter to backend. Replace the OTLP endpoint placeholder +# at deploy time; tracecore does not expand environment variables in +# YAML, so render the literal with a secret-injection tool +# (envsubst, External Secrets, sealed-secrets) before `helm +# install`. See docs/integrations/k8sobjects-events.md. + +receivers: + k8sobjects: + auth_type: serviceAccount + objects: + - name: events + mode: watch + group: "" + +processors: + # Derive the 11-entry `k8s.event.hint` enum from the Kubernetes + # Event.reason field so operator alerts against §3's customer-stable + # contract survive the receiver swap. Reason values come from + # kubernetes/kubernetes/pkg/kubelet/events/event.go. + transform/hint: + log_statements: + - context: log + statements: + - set(attributes["k8s.event.hint"], "pod_evicted") where body["object"]["reason"] == "Evicted" + - set(attributes["k8s.event.hint"], "oom_killed") where body["object"]["reason"] == "OOMKilling" + - set(attributes["k8s.event.hint"], "backoff") where body["object"]["reason"] == "BackOff" + - set(attributes["k8s.event.hint"], "create_failure") where body["object"]["reason"] == "FailedCreatePodSandBox" or body["object"]["reason"] == "FailedCreate" + - set(attributes["k8s.event.hint"], "schedule_failure") where body["object"]["reason"] == "FailedScheduling" + - set(attributes["k8s.event.hint"], "mount_failure") where body["object"]["reason"] == "FailedMount" + - set(attributes["k8s.event.hint"], "volume_attach_failure") where body["object"]["reason"] == "FailedAttachVolume" + - set(attributes["k8s.event.hint"], "image_pull_failure") where body["object"]["reason"] == "Failed" or body["object"]["reason"] == "ErrImagePull" or body["object"]["reason"] == "ImagePullBackOff" + - set(attributes["k8s.event.hint"], "container_status_unknown") where body["object"]["reason"] == "ContainerStatusUnknown" + - set(attributes["k8s.event.hint"], "node_pressure") where body["object"]["reason"] == "EvictionThresholdMet" or body["object"]["reason"] == "NodeHasInsufficientMemory" or body["object"]["reason"] == "NodeHasDiskPressure" or body["object"]["reason"] == "NodeHasInsufficientPID" + - set(attributes["k8s.event.hint"], "node_unhealthy") where body["object"]["reason"] == "NodeNotReady" or body["object"]["reason"] == "NodeNotSchedulable" + batch: + send_batch_size: 1024 + timeout: 10s + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + pipelines: + logs/k8sevents: + receivers: [k8sobjects] + processors: [transform/hint, batch] + exporters: [otlphttp] diff --git a/docs/integrations/examples/prometheus-scrape.yaml b/docs/integrations/examples/prometheus-scrape.yaml new file mode 100644 index 00000000..2c4778cc --- /dev/null +++ b/docs/integrations/examples/prometheus-scrape.yaml @@ -0,0 +1,67 @@ +# Generic Prometheus scrape via upstream `prometheusreceiver` - +# the adoption shape for every vendor GPU exporter per RFC-0013 §2 +# (NVIDIA `dcgm-exporter`, AMD `ROCm/device-metrics-exporter`, +# Intel `intel/xpumanager`, Habana Prometheus Metric Exporter) and +# for Kueue scheduler metrics. Replaces the in-tree `dcgm` and +# `kueue` receivers (deleted at v0.1.0 per RFC-0013 §7 deletion +# table). The OCB-assembled tracecore binary bundles every +# component below; `tracecore validate` covers this file. +# +# Deployment shape: tracecore DaemonSet (for per-node scrape +# targets like dcgm-exporter) or Deployment (for cluster-scoped +# targets like the Kueue control-plane) -> Prometheus-style scrape +# at the configured interval -> OTTL transform normalizes +# customer-stable resource attributes (`gpu.vendor`, `gpu.id`) -> +# otlphttpexporter to backend. The example below scrapes a +# dcgm-exporter DaemonSet at the conventional :9400/metrics +# endpoint. Replace the OTLP endpoint and dcgm-exporter target +# placeholders at deploy time; tracecore does not expand +# environment variables in YAML, so render the literals with a +# secret-injection tool (envsubst, External Secrets, +# sealed-secrets) before `helm install`. See +# docs/integrations/prometheus-scrape.md. + +receivers: + prometheus: + config: + scrape_configs: + - job_name: dcgm-exporter + scrape_interval: 15s + scrape_timeout: 10s + metrics_path: /metrics + static_configs: + - targets: + - REPLACE_WITH_DCGM_EXPORTER_TARGET + # Add bearer_token / tls_config blocks here for + # authenticated targets such as Kueue's controller-manager + # metrics endpoint; see the recipe markdown for the full + # field list. + +processors: + # Normalize the cross-vendor customer-stable attributes from + # RFC-0013 §3 so dashboards survive a future swap from + # dcgm-exporter to AMD/Intel/Habana equivalents. + transform/gpu_vendor: + metric_statements: + - context: datapoint + statements: + - set(resource.attributes["gpu.vendor"], "nvidia") where IsMatch(metric.name, "^DCGM_") + - set(resource.attributes["gpu.vendor"], "amd") where IsMatch(metric.name, "^amdsmi_") + - set(resource.attributes["gpu.vendor"], "intel") where IsMatch(metric.name, "^xpum_") + - set(resource.attributes["gpu.vendor"], "habana") where IsMatch(metric.name, "^habanalabs_") + batch: + send_batch_size: 8192 + timeout: 10s + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + pipelines: + metrics/scrape: + receivers: [prometheus] + processors: [transform/gpu_vendor, batch] + exporters: [otlphttp] diff --git a/docs/integrations/filelog-container.md b/docs/integrations/filelog-container.md new file mode 100644 index 00000000..10b69725 --- /dev/null +++ b/docs/integrations/filelog-container.md @@ -0,0 +1,158 @@ + + + +# Container stdout via `filelogreceiver` + container parser + +Tracecore tails container stdout/stderr files under `/var/log/pods/` +on each node using the upstream `filelogreceiver` with a container +parser stanza. The `k8sattributesprocessor` enriches each record with +pod, namespace, and workload identity; the `file_storage` extension +checkpoints read offsets across restarts so log lines are not +re-shipped on rollouts. Replaces the in-tree `containerstdout` +receiver scheduled for deletion at v0.2.0 per +[RFC-0013 §migration PR-K](../rfcs/0013-distro-first-pivot.md#migration--rollout) +and §7 (Deletion list). + +## Config + +```yaml +# docs/integrations/examples/filelog-container.yaml +extensions: + file_storage/checkpoints: + directory: /var/lib/tracecore/filelog + create_directory: true + timeout: 1s + compaction: + directory: /var/lib/tracecore/filelog + on_start: true + on_rebound: true + +receivers: + filelog/container: + include: + - /var/log/pods/*/*/*.log + exclude: + - /var/log/pods/*/tracecore/*.log + start_at: end + include_file_path: true + include_file_name: false + storage: file_storage/checkpoints + operators: + - id: container-parser + type: container + format: auto + add_metadata_from_filepath: true + - id: severity-parser + type: severity_parser + parse_from: attributes.stream + mapping: + error: stderr + info: stdout + if: 'attributes.stream != nil' + +processors: + k8sattributes: + auth_type: serviceAccount + passthrough: false + extract: + metadata: + - k8s.namespace.name + - k8s.pod.name + - k8s.pod.uid + - k8s.deployment.name + - k8s.statefulset.name + - k8s.daemonset.name + - k8s.node.name + - k8s.container.name + labels: + - tag_name: app + key: app.kubernetes.io/name + from: pod + pod_association: + - sources: + - from: resource_attribute + name: k8s.pod.uid + - sources: + - from: resource_attribute + name: k8s.namespace.name + - from: resource_attribute + name: k8s.pod.name + batch: + send_batch_size: 8192 + timeout: 5s + send_batch_max_size: 16384 + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + extensions: [file_storage/checkpoints] + pipelines: + logs/container: + receivers: [filelog/container] + processors: [k8sattributes, batch] + exporters: [otlphttp] +``` + +Validate with the in-tree binary: + +```sh +./_build/tracecore validate --config=docs/integrations/examples/filelog-container.yaml +``` + +Exit 0 means the config parses, every component name resolves, +`create_directory: true` will materialize the checkpoint directory at +boot, and the OTLP endpoint URL is well-formed once the placeholder +has been rendered. + +## Deployment shape + +Run tracecore as a `DaemonSet` so every node-local log file is read +by the pod scheduled on that node. The DaemonSet pod template must +mount three host paths read-write or read-only as listed: + +- `/var/log/pods` (read-only) — kubelet writes per-container log + files here. Required by `filelog/container::include`. +- `/var/log/containers` (read-only) — symlink farm kubelet maintains + for the container parser's `add_metadata_from_filepath` to resolve + pod / namespace / container names without an API call. +- `/var/lib/tracecore/filelog` (read-write) — `file_storage` + checkpoints. `create_directory: true` removes the need for an + initContainer to pre-create the path. + +The DaemonSet ServiceAccount needs +`pods get,list,watch` + `nodes get` for `k8sattributesprocessor`. + +## Placeholders + +| Placeholder | What to fill in | +|---|---| +| `REPLACE_WITH_OTLP_HTTP_ENDPOINT` | The OTLP/HTTP base URL of your sink. `/v1/logs` is appended automatically per the OTLP/HTTP spec — do not include it. | + +Tracecore does not expand environment variables in YAML. Render the +literal endpoint at deploy time via `envsubst`, a Helm template, or a +Kubernetes secret-injection driver. The placeholder is loud (the +exporter rejects it on first dispatch) so a misconfigured rollout +fails immediately instead of silently dropping logs. + +## Failure modes + +| Symptom | First check | +|---|---| +| `directory must exist` at validate | Remove a stray operator override of `file_storage::directory` — `create_directory: true` only fires for the path declared in the same extension stanza. | +| Logs flow but `k8s.pod.name` is empty | The DaemonSet ServiceAccount is missing `pods get,list,watch`. Check `kubectl auth can-i list pods --as system:serviceaccount::`. | +| Duplicate log lines after a restart | `start_at: end` ships only NEW lines; if you see duplicates the checkpoint directory is on an emptyDir or hostPath that got recreated. Move it to a node-local persistent path under `/var/lib/`. | +| `failed to open /var/log/pods/...` | The DaemonSet pod is missing the `/var/log/pods` hostPath mount, or the path is mounted read-write (kubelet's `--root-dir` overrides shift this on some distros). Mount read-only at the kubelet's actual root. | +| High-cardinality label explosion | The container parser surfaces every label from `app.kubernetes.io/name` plus whatever you add under `extract::labels`. Audit the list against the receiving backend's cardinality budget before adding more. | + +Upstream component docs: +[`receiver/filelogreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver), +[`receiver/filelogreceiver/internal/parser/container`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/pkg/stanza/operator/parser/container), +[`processor/k8sattributesprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/k8sattributesprocessor), +[`extension/storage/filestorage`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/extension/storage/filestorage). +Self-telemetry counters appear under the standard +`otelcol_receiver_*` and `otelcol_processor_*` metric families from +`service/telemetry`. diff --git a/docs/integrations/journald-kernel.md b/docs/integrations/journald-kernel.md new file mode 100644 index 00000000..ed536736 --- /dev/null +++ b/docs/integrations/journald-kernel.md @@ -0,0 +1,164 @@ + + + +# Kernel + systemd events via `journaldreceiver` + `filelogreceiver` + OTTL + +Tracecore captures node-level kernel and systemd events by tailing +journald and `/dev/kmsg` and normalizing the records through an OTTL +`transform` processor. The transform preserves the customer-stable +`kernelevents.xid` and `gpu.id` attributes from +[RFC-0013 §3](../rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts) +so existing operator alerts survive the swap. Replaces the in-tree +`kernelevents` receiver scheduled for deletion at v0.2.0 per +[RFC-0013 §migration PR-K](../rfcs/0013-distro-first-pivot.md#migration--rollout) +and §7 (Deletion list). + +## Config + +```yaml +# docs/integrations/examples/journald-kernel.yaml +receivers: + journald: + directory: /var/log/journal + units: + - kubelet.service + - containerd.service + - systemd-networkd.service + priority: info + filelog/kmsg: + include: + - /dev/kmsg + start_at: end + include_file_path: false + operators: + - id: kmsg-parser + type: regex_parser + regex: '^(?P\d+),(?P\d+),(?P\d+),(?P\w+);(?P.*)$' + severity: + parse_from: attributes.priority + mapping: + error: "3" + warn: "4" + info: "6" + debug: "7" + +processors: + transform/kmsg_xid: + log_statements: + - context: log + statements: + - 'set(attributes["kernelevents.xid"], Int(ExtractPatterns(body, "NVRM: Xid \\(PCI:[0-9a-fA-F:.]+\\): (?P\\d+)")["xid"])) where IsMatch(body, "NVRM: Xid")' + - 'set(attributes["gpu.id"], ExtractPatterns(body, "NVRM: Xid \\((?PPCI:[0-9a-fA-F:.]+)\\)")["bdf"]) where IsMatch(body, "NVRM: Xid")' + transform/journald_service_name: + log_statements: + - context: log + statements: + - 'set(attributes["service.name"], body["_SYSTEMD_UNIT"]) where body["_SYSTEMD_UNIT"] != nil' + batch: + send_batch_size: 4096 + timeout: 5s + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + pipelines: + logs/kmsg: + receivers: [filelog/kmsg] + processors: [transform/kmsg_xid, batch] + exporters: [otlphttp] + logs/journald: + receivers: [journald] + processors: [transform/journald_service_name, batch] + exporters: [otlphttp] +``` + +Validate with the in-tree binary: + +```sh +./_build/tracecore validate --config=docs/integrations/examples/journald-kernel.yaml +``` + +Exit 0 means the config parses, the regex compiles, every OTTL +statement type-checks, and the OTLP endpoint URL is well-formed once +the placeholder is rendered. `journaldreceiver` ships as Alpha +upstream — pin a contrib release in deploy-time CI before promoting +this recipe to a production rollout. + +## Deployment shape + +Run tracecore as a `DaemonSet` with `hostPID: false`, hostPath +mounts: + +- `/var/log/journal` (read-only) — journald persistent logs. + Required by `journald::directory`. If the host uses runtime + journald (`/run/log/journal/`) instead of persistent storage, point + the receiver at that path; the receiver does not merge multiple + directories. +- `/dev/kmsg` (read-only) — kernel ring buffer. Required by + `filelog/kmsg::include`. The DaemonSet pod's `securityContext` may + need `SYS_ADMIN` or `readOnlyRootFilesystem: false` on hardened + distros where `/dev/kmsg` is not group-readable by `nonroot`; the + upstream tracecore chart leaves the receiver disabled by default so + the security trade-off is an opt-in. + +## OTTL statement quoting + +The OTTL `statements:` strings are **single-quoted** in this recipe. +Embedded `:` characters (inside regex character classes like +`[0-9a-fA-F:.]`) would otherwise cause YAML to attempt mapping-key +parsing of the statement and produce +`type=string cannot be used as a Conf` at validate time. Future +edits MUST keep the single-quote wrapper; the validator-recipe CI +gate fires immediately if the quoting drifts. + +## Pipeline split: kmsg vs journald + +The recipe ships **two pipelines** instead of one fanout because the +two receivers emit records with different body shapes: + +- `filelog/kmsg` emits `body=string` (the raw kmsg line; the + `regex_parser` captures land in `attributes`, not in body). +- `journald` emits `body=map` (every journald field keyed verbatim; + e.g. `body["_SYSTEMD_UNIT"]` is the unit name). + +Running a single shared `transform` over both feeds would force +`IsMatch(body, ...)` to run against a map for journald records and +runtime-error. Keeping the transforms next to the receivers that +shaped the body keeps each statement statically correct. + +## Customer-stable attribute mapping + +| Surface | Source | Notes | +|---|---|---| +| `kernelevents.xid` (log attr, int) | OTTL regex on `body` for `NVRM: Xid (PCI:...): NN` in the `logs/kmsg` pipeline. | Preserves the [RFC-0013 §3](../rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts) contract. The transform is idempotent — re-running on an already-tagged record overwrites with the same value. | +| `gpu.id` (log attr, string, PCI BDF) | OTTL regex on `body` extracting the `PCI:...` segment of the Xid line in the `logs/kmsg` pipeline. | Joins to the `gpu.id` attribute on `prometheusreceiver` GPU scrapes (per the [`prometheus-scrape`](prometheus-scrape.md) recipe), enabling cross-signal pattern correlation. | +| `service.name` (log attr) | Copied from `body["_SYSTEMD_UNIT"]` on journald records in the `logs/journald` pipeline. | Lets backend filters use the standard OTel resource convention instead of the journald-specific `_SYSTEMD_*` keys. | + +## Placeholders + +| Placeholder | What to fill in | +|---|---| +| `REPLACE_WITH_OTLP_HTTP_ENDPOINT` | The OTLP/HTTP base URL of your sink. `/v1/logs` is appended automatically per the OTLP/HTTP spec. | + +Tracecore does not expand environment variables in YAML. Render the +literal endpoint at deploy time via `envsubst`, a Helm template, or a +Kubernetes secret-injection driver. + +## Failure modes + +| Symptom | First check | +|---|---| +| `cannot resolve the configuration: retrieved value (type=string) cannot be used as a Conf` | An OTTL statement lost its single-quote wrapper. Re-add the surrounding `'...'`. | +| `journald: directory does not exist` | The host stores journal logs at `/run/log/journal/` (runtime-only, journald-volatile mode). Change `journald::directory` to that path or enable persistent journald (`Storage=persistent` in `/etc/systemd/journald.conf`). | +| `filelog/kmsg: open /dev/kmsg: permission denied` | The DaemonSet ServiceAccount or pod `securityContext` cannot read `/dev/kmsg`. On a `restricted` Pod Security Standard, add a `runAsUser: 0` override on this pod ONLY (kmsg is root-readable on most distros) and document the deviation in the chart values. | +| `kernelevents.xid` empty on a known-Xid line | The kernel formatted the Xid differently (some drivers emit `Xid (PCI:0000:...): 79` vs `Xid 79`). Adjust the OTTL regex; the test fixture for the upstream kmsg parser lives in `tools/failure-inject/testdata/`. | +| Severity always `INFO` | The kmsg priority field did not match. Recheck the `regex_parser::regex` — kernels older than 3.5 emit a different format. | + +Upstream component docs: +[`receiver/journaldreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/journaldreceiver), +[`receiver/filelogreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/filelogreceiver), +[`processor/transformprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor). diff --git a/docs/integrations/k8sobjects-events.md b/docs/integrations/k8sobjects-events.md new file mode 100644 index 00000000..38e646db --- /dev/null +++ b/docs/integrations/k8sobjects-events.md @@ -0,0 +1,167 @@ + + + +# Kubernetes events via `k8sobjectsreceiver` + OTTL `k8s.event.hint` + +Tracecore watches the Kubernetes Events API via the upstream +`k8sobjectsreceiver` and derives the customer-stable +`k8s.event.hint` attribute (the 11-entry enum from +[RFC-0013 §3](../rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts): +`pod_evicted`, `mount_failure`, `backoff`, `oom_killed`, +`node_unhealthy`, `schedule_failure`, `create_failure`, +`volume_attach_failure`, `container_status_unknown`, +`node_pressure`, `image_pull_failure`) via an OTTL `transform` +processor. Replaces the in-tree `k8sevents` receiver scheduled for +deletion at v0.2.0 per +[RFC-0013 §migration PR-K](../rfcs/0013-distro-first-pivot.md#migration--rollout) +and §7 (Deletion list). + +> **Validation note.** The upstream `k8sobjectsreceiver` calls +> `Config.Validate()` which enumerates server-preferred API +> resources via the discovery client. That call needs a reachable +> API server, so `tracecore validate --config=...` fails offline +> with `KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be +> defined`. The recipe carries the +> `` marker and the +> `validator-recipe` CI gate skips it with a named log line. The +> example is exercised at chart-install time by the kind cluster CI +> job (`.github/workflows/chart.yml`); see the Verification section +> for the operator-side check. + +## Config + +```yaml +# docs/integrations/examples/k8sobjects-events.yaml +receivers: + k8sobjects: + auth_type: serviceAccount + objects: + - name: events + mode: watch + group: "" + +processors: + transform/hint: + log_statements: + - context: log + statements: + - set(attributes["k8s.event.hint"], "pod_evicted") where body["object"]["reason"] == "Evicted" + - set(attributes["k8s.event.hint"], "oom_killed") where body["object"]["reason"] == "OOMKilling" + - set(attributes["k8s.event.hint"], "backoff") where body["object"]["reason"] == "BackOff" + - set(attributes["k8s.event.hint"], "create_failure") where body["object"]["reason"] == "FailedCreatePodSandBox" or body["object"]["reason"] == "FailedCreate" + - set(attributes["k8s.event.hint"], "schedule_failure") where body["object"]["reason"] == "FailedScheduling" + - set(attributes["k8s.event.hint"], "mount_failure") where body["object"]["reason"] == "FailedMount" + - set(attributes["k8s.event.hint"], "volume_attach_failure") where body["object"]["reason"] == "FailedAttachVolume" + - set(attributes["k8s.event.hint"], "image_pull_failure") where body["object"]["reason"] == "Failed" or body["object"]["reason"] == "ErrImagePull" or body["object"]["reason"] == "ImagePullBackOff" + - set(attributes["k8s.event.hint"], "container_status_unknown") where body["object"]["reason"] == "ContainerStatusUnknown" + - set(attributes["k8s.event.hint"], "node_pressure") where body["object"]["reason"] == "EvictionThresholdMet" or body["object"]["reason"] == "NodeHasInsufficientMemory" or body["object"]["reason"] == "NodeHasDiskPressure" or body["object"]["reason"] == "NodeHasInsufficientPID" + - set(attributes["k8s.event.hint"], "node_unhealthy") where body["object"]["reason"] == "NodeNotReady" or body["object"]["reason"] == "NodeNotSchedulable" + batch: + send_batch_size: 1024 + timeout: 10s + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + pipelines: + logs/k8sevents: + receivers: [k8sobjects] + processors: [transform/hint, batch] + exporters: [otlphttp] +``` + +## Deployment shape + +Run tracecore as a **single-replica `Deployment`** (NOT a DaemonSet) +so exactly one watcher streams the Events API. The +`k8sobjectsreceiver` does no client-side deduplication — N replicas +will produce N copies of every event. + +The ServiceAccount needs `events get,list,watch` cluster-wide: + +```yaml +apiVersion: rbac.authorization.k8s.io/v1 +kind: ClusterRole +metadata: { name: tracecore-events } +rules: + - apiGroups: [""] + resources: ["events"] + verbs: ["get", "list", "watch"] +``` + +## `k8s.event.hint` enum mapping + +The eleven values are derived from `Event.reason` strings emitted by +the kubelet and core controllers +([source](https://github.com/kubernetes/kubernetes/blob/master/pkg/kubelet/events/event.go)). +The mapping is intentionally exhaustive over the failure modes +operators actually alert on — non-failure reasons (`Started`, +`Pulled`, `Created`, `Scheduled`) intentionally do NOT receive a +hint so dashboards can filter on `attributes["k8s.event.hint"] != +nil` to surface only actionable events. + +| Hint | `Event.reason` triggers | +|---|---| +| `pod_evicted` | `Evicted` | +| `oom_killed` | `OOMKilling` | +| `backoff` | `BackOff` | +| `create_failure` | `FailedCreatePodSandBox`, `FailedCreate` | +| `schedule_failure` | `FailedScheduling` | +| `mount_failure` | `FailedMount` | +| `volume_attach_failure` | `FailedAttachVolume` | +| `image_pull_failure` | `Failed`, `ErrImagePull`, `ImagePullBackOff` | +| `container_status_unknown` | `ContainerStatusUnknown` | +| `node_pressure` | `EvictionThresholdMet`, `NodeHasInsufficientMemory`, `NodeHasDiskPressure`, `NodeHasInsufficientPID` | +| `node_unhealthy` | `NodeNotReady`, `NodeNotSchedulable` | + +When the upstream kubelet adds a new failure reason, extend +`transform/hint` with one new `set(...) where ...` line; the +eleven-entry enum is closed so anything not on the list above MUST +land with `k8s.event.hint = nil` to avoid silently miscategorizing. + +## Placeholders + +| Placeholder | What to fill in | +|---|---| +| `REPLACE_WITH_OTLP_HTTP_ENDPOINT` | The OTLP/HTTP base URL of your sink. `/v1/logs` is appended automatically per the OTLP/HTTP spec. | + +Tracecore does not expand environment variables in YAML. Render the +literal endpoint at deploy time via `envsubst`, a Helm template, or a +Kubernetes secret-injection driver. + +## Verification + +Because `tracecore validate` cannot cover this recipe offline, the +operator-side check is: + +1. **Apply the rendered config inside the cluster.** Boot tracecore + via `helm install` or `kubectl apply`. The pod's `READY` should + flip within ~3s; the `healthcheckextension` endpoint returns 200 + once `k8sobjectsreceiver` has completed its initial Events list. +2. **Trigger a known-hint event.** `kubectl run pause --image= + pause:3.10 --restart=Never --overrides='{"spec":{"containers": + [{"name":"pause","image":"pause:3.10","resources":{"limits": + {"memory":"1Ki"}}}]}}'` provokes an `OOMKilling` event within + seconds. +3. **Confirm `k8s.event.hint=oom_killed`** lands at the backend. If + the attribute is missing the OTTL statement order is wrong (later + statements overwrite earlier ones — the recipe is ordered so the + most specific match wins). + +## Failure modes + +| Symptom | First check | +|---|---| +| `KUBERNETES_SERVICE_HOST and KUBERNETES_SERVICE_PORT must be defined` at validate | Expected when running `tracecore validate` outside a pod; this recipe ships with `tested-against: requires-k8s-cluster` so the CI gate skips it. Use the Verification section to gate operator rollout. | +| `events.events.k8s.io is forbidden` | The ServiceAccount is missing the ClusterRole binding. `kubectl auth can-i list events --as system:serviceaccount::` should return `yes`. | +| Duplicate events | The Deployment is running >1 replica. `k8sobjectsreceiver` is single-watcher by design; scale `replicas: 1`. | +| Events flow but `k8s.event.hint` is always `nil` | The `Event.reason` is not on the eleven-entry list. Decide: (a) extend `transform/hint`, or (b) leave the enum closed — see the section above. | +| Very old events on boot | `k8sobjectsreceiver` in `watch` mode resumes from the cluster's last `resourceVersion`, which can be hours stale. Set `resource_version: "0"` to read from now; expect a one-time list operation against the kube-apiserver. | + +Upstream component docs: +[`receiver/k8sobjectsreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/k8sobjectsreceiver), +[`processor/transformprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor). diff --git a/docs/integrations/prometheus-scrape.md b/docs/integrations/prometheus-scrape.md new file mode 100644 index 00000000..e795acc2 --- /dev/null +++ b/docs/integrations/prometheus-scrape.md @@ -0,0 +1,153 @@ + + + +# Prometheus scrape via `prometheusreceiver` + +Tracecore scrapes Prometheus-format endpoints via the upstream +`prometheusreceiver`. This is the adoption shape for every vendor +GPU exporter per +[RFC-0013 §2 (Adoption matrix)](../rfcs/0013-distro-first-pivot.md#2-adoption-matrix): +NVIDIA `dcgm-exporter`, AMD `ROCm/device-metrics-exporter`, Intel +`intel/xpumanager`, Habana Prometheus Metric Exporter — and for +the Kueue scheduler's metrics endpoint. Replaces the in-tree `dcgm` +and `kueue` receivers per RFC-0013 §7 (Deletion list — v0.1.0). +An OTTL `transform` processor stamps the customer-stable +`gpu.vendor` resource attribute (RFC-0013 §3) so dashboards survive +a future swap between vendor exporters. + +## Config + +```yaml +# docs/integrations/examples/prometheus-scrape.yaml +receivers: + prometheus: + config: + scrape_configs: + - job_name: dcgm-exporter + scrape_interval: 15s + scrape_timeout: 10s + metrics_path: /metrics + static_configs: + - targets: + - REPLACE_WITH_DCGM_EXPORTER_TARGET + +processors: + transform/gpu_vendor: + metric_statements: + - context: datapoint + statements: + - set(resource.attributes["gpu.vendor"], "nvidia") where IsMatch(metric.name, "^DCGM_") + - set(resource.attributes["gpu.vendor"], "amd") where IsMatch(metric.name, "^amdsmi_") + - set(resource.attributes["gpu.vendor"], "intel") where IsMatch(metric.name, "^xpum_") + - set(resource.attributes["gpu.vendor"], "habana") where IsMatch(metric.name, "^habanalabs_") + batch: + send_batch_size: 8192 + timeout: 10s + +exporters: + otlphttp: + endpoint: REPLACE_WITH_OTLP_HTTP_ENDPOINT + compression: gzip + timeout: 10s + +service: + pipelines: + metrics/scrape: + receivers: [prometheus] + processors: [transform/gpu_vendor, batch] + exporters: [otlphttp] +``` + +Validate with the in-tree binary: + +```sh +./_build/tracecore validate --config=docs/integrations/examples/prometheus-scrape.yaml +``` + +Exit 0 means the config parses, every scrape target URL is +well-formed, and the OTTL statements type-check against the +metric-datapoint context. + +## Deployment shape + +The right Kubernetes shape depends on the scrape target: + +- **Per-node targets** (NVIDIA `dcgm-exporter`, + AMD/Intel/Habana per-node exporters): run tracecore as a + `DaemonSet` and scrape `localhost:` so each node's exporter + is read by the tracecore pod on the same node. No cluster-wide + service discovery required. +- **Cluster-scoped targets** (Kueue's controller-manager metrics + endpoint, single-replica vendor exporters): run tracecore as a + single-replica `Deployment` and scrape the target's Service. Pair + with `kubernetes_sd_configs:` if the target moves between pods on + re-roll; for a stable Service ClusterIP, `static_configs:` is + enough. + +## Adding authenticated targets (Kueue, vendor exporters) + +The example scrapes a static unauthenticated target. For Kueue's +controller-manager metrics endpoint (TLS + serviceaccount-token bearer): + +```yaml + - job_name: kueue + scheme: https + scrape_interval: 30s + metrics_path: /metrics + authorization: + type: Bearer + credentials_file: /var/run/secrets/kubernetes.io/serviceaccount/token + tls_config: + ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt + server_name: kueue-controller-manager-metrics-service.kueue-system.svc + static_configs: + - targets: + - kueue-controller-manager-metrics-service.kueue-system.svc:8443 +``` + +Adjust `server_name` to match the Service's DNS name. The +`credentials_file` path is the default ServiceAccount projected-token +mount; if you use a custom token volume, update the path. + +## `gpu.vendor` resource-attribute mapping + +The OTTL transform routes to a vendor tag based on the metric-name +prefix each upstream exporter uses: + +| `metric.name` prefix | `gpu.vendor` | Upstream exporter | +|---|---|---| +| `DCGM_*` | `nvidia` | [NVIDIA/dcgm-exporter](https://github.com/NVIDIA/dcgm-exporter) | +| `amdsmi_*` | `amd` | [ROCm/device-metrics-exporter](https://github.com/ROCm/device-metrics-exporter) | +| `xpum_*` | `intel` | [intel/xpumanager](https://github.com/intel/xpumanager) | +| `habanalabs_*` | `habana` | Habana Prometheus Metric Exporter | + +The tag survives the [RFC-0013 §3](../rfcs/0013-distro-first-pivot.md#3-customer-stable-telemetry-contracts) +contract; existing dashboards keyed on `gpu.vendor` continue to +work after a vendor swap. + +## Placeholders + +| Placeholder | What to fill in | +|---|---| +| `REPLACE_WITH_OTLP_HTTP_ENDPOINT` | The OTLP/HTTP base URL of your sink. `/v1/metrics` is appended automatically per the OTLP/HTTP spec. | +| `REPLACE_WITH_DCGM_EXPORTER_TARGET` | `localhost:9400` for a DaemonSet shape, or the dcgm-exporter Service DNS (`dcgm-exporter.kube-system.svc:9400`) for a Deployment shape. | + +Tracecore does not expand environment variables in YAML. Render the +literals at deploy time via `envsubst`, a Helm template, or a +Kubernetes secret-injection driver. The `:port` suffix is mandatory +— `prometheusreceiver` rejects bare hostnames at validate. + +## Failure modes + +| Symptom | First check | +|---|---| +| `scrape_configs.targets[0]: address ... incorrect` at validate | The target placeholder still carries `REPLACE_WITH_DCGM_EXPORTER_TARGET` — the validator now rejects literal placeholders that look like hostnames. Render at deploy time. | +| Scrape returns 200 but no metrics flow | `prometheusreceiver` requires the response to be in Prometheus text exposition format. A target that returns OTLP-JSON or vendor-proprietary format silently drops. Curl the endpoint and confirm the first line starts with `# HELP`. | +| `gpu.vendor` empty on a known DCGM target | The exporter is on an old release that emits the legacy `dcgm_*` prefix (lowercase). Either upgrade the exporter to a `DCGM_*`-emitting build or extend the OTTL regex to `^[Dd][Cc][Gg][Mm]_`. | +| `cardinality limit exceeded` from the backend | `prometheusreceiver` does not cap series. Add a `filterprocessor` between `prometheus` and `transform/gpu_vendor` to drop metrics you don't query. Cap dcgm-exporter's `--collectors` flag to the families you alert on. | +| Bearer-token target returns 401 | The ServiceAccount lacks the binding to the target's RBAC. For Kueue, the SA needs `nonResourceURLs: ["/metrics"] verbs: ["get"]` via a ClusterRoleBinding. | + +Upstream component docs: +[`receiver/prometheusreceiver`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/receiver/prometheusreceiver), +[`processor/transformprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/transformprocessor), +[`processor/filterprocessor`](https://github.com/open-telemetry/opentelemetry-collector-contrib/tree/main/processor/filterprocessor). diff --git a/docs/migration/v0.1-to-v0.2.md b/docs/migration/v0.1-to-v0.2.md index 7ba0e171..433e6dab 100644 --- a/docs/migration/v0.1-to-v0.2.md +++ b/docs/migration/v0.1-to-v0.2.md @@ -196,3 +196,11 @@ helm upgrade tracecore install/kubernetes/tracecore \ ``` Charts are pinned by `--version` (the chart-package version from `Chart.yaml`), not by `appVersion`; the image tag pins the binary independently. No data is mutated on upgrade. + +## Open items (fill in as PRs land) + +- [ ] PR-I (in-repo Go submodule extraction at `module/`) — link +- [x] PR-J (ship recipes for filelog + journald + k8sobjects + prometheus) — [`docs/integrations/{filelog-container,journald-kernel,k8sobjects-events,prometheus-scrape}.md`](../integrations/) +- [ ] PR-K (delete in-tree receivers) — link +- [ ] PR-L (this guide, full body) — link +- [ ] PR-E unblocking decision (heartbeat replacement) — link diff --git a/scripts/doc-check.sh b/scripts/doc-check.sh index f05e89d9..94096596 100755 --- a/scripts/doc-check.sh +++ b/scripts/doc-check.sh @@ -342,7 +342,14 @@ if [ -d "$integration_dir" ]; then # marker for recipes that route through `./tracecore validate` but # depend on exporters not bundled until RFC-0013 §Migration PR-A # (OCB skeleton) lands. Flip back to `tracecore` when PR-A merges. - if ! grep -qE '^' "$recipe"; then + # `requires-k8s-cluster` covers recipes whose receiver's Validate() + # makes a live API call (e.g. k8sobjectsreceiver enumerates server- + # preferred resources at validate time); `tracecore validate` cannot + # cover them offline, so the recipe MUST instead document a CI/ + # cluster-side verification path. The validator-recipe gate skips + # the marker with a named log line; the recipe's example YAML still + # ships and is consumed by chart / kind tests. + if ! grep -qE '^' "$recipe"; then echo "doc-check: $recipe missing \`\` or \`\` marker" exit 1 fi diff --git a/scripts/validator-recipe.sh b/scripts/validator-recipe.sh index 5ea42f4f..299af693 100755 --- a/scripts/validator-recipe.sh +++ b/scripts/validator-recipe.sh @@ -172,6 +172,15 @@ while IFS= read -r recipe; do echo "validator-recipe: $recipe -> SKIP (pending RFC-0013 §Migration PR-A: exporter not yet bundled)" skipped_count=$((skipped_count + 1)) ;; + requires-k8s-cluster) + # The receiver's upstream Validate() reaches a live Kubernetes + # API server (e.g. k8sobjectsreceiver enumerates server-preferred + # resources before listing the requested objects). `tracecore + # validate` cannot cover this offline. The recipe documents the + # cluster-side verification path it ships with. + echo "validator-recipe: $recipe -> SKIP (requires-k8s-cluster: receiver Validate() needs a live API server; see the recipe's Verification section)" + skipped_count=$((skipped_count + 1)) + ;; *) echo "validator-recipe: $recipe carries unrecognized tested-against marker: $marker" >&2 fail=1