Node Doctor Monitors Documentation

This document provides comprehensive information about all monitor types available in Node Doctor, their configuration options, and how to create custom monitors.

Overview
Architecture
System Monitors
Network Monitors
Kubernetes Monitors
Custom Monitors
- Plugin Monitor
- Log Pattern Monitor
Troubleshooting
Creating Custom Monitors

Overview

Node Doctor provides a comprehensive set of health monitors for Kubernetes nodes. Each monitor is designed to detect specific types of issues and report them through a unified status reporting system.

Key Features:

Pluggable Architecture: Register monitors via the factory pattern
Thread-Safe: All monitors use mutex protection for concurrent operations
Context-Based Timeouts: Proper cancellation and timeout enforcement
Failure Threshold Tracking: Prevent false positives from transient failures
Recovery Detection: Automatically report when issues are resolved
Configurable: Fine-tune behavior through YAML configuration

Architecture

BaseMonitor Pattern

All monitors extend BaseMonitor which provides:

Lifecycle management (Start/Stop)
Periodic check scheduling
Status channel management
Graceful shutdown handling

type BaseMonitor struct {
    name       string
    interval   time.Duration
    timeout    time.Duration
    statusChan chan *types.Status
    checkFunc  CheckFunc
    logger     Logger
}

Monitor Registry

Monitors self-register at initialization using the registry pattern:

func init() {
    monitors.Register(monitors.MonitorInfo{
        Type:        "system-cpu-check",
        Factory:     NewCPUMonitor,
        Validator:   ValidateCPUConfig,
        Description: "Monitors CPU load average and thermal throttling",
    })
}

Status Reporting

Monitors report health through types.Status:

Events: Point-in-time occurrences (Info, Warning, Error)
Conditions: Persistent states (True/False with Reason and Message)

System Monitors

CPU Monitor

Monitors CPU load average and thermal throttling on the system.

Monitor Type: system-cpu-check

Source File: pkg/monitors/system/cpu.go:584

Configuration:

monitors:
  - name: cpu-health
    type: system-cpu-check
    interval: 30s
    timeout: 5s
    config:
      warningLoadFactor: 0.8        # 80% of CPU cores (auto-validated: 0-100)
      criticalLoadFactor: 1.5       # 150% of CPU cores (auto-validated: 0-100)
      sustainedHighLoadChecks: 3    # Consecutive checks before alerting
      checkThermalThrottle: true
      checkLoadAverage: true

Default Values:

warningLoadFactor: 0.8 (80%)
criticalLoadFactor: 1.5 (150%)
sustainedHighLoadChecks: 3
checkThermalThrottle: true
checkLoadAverage: true

Note: Threshold values (warningLoadFactor, criticalLoadFactor) are automatically validated to be in the range 0-100. See Configuration Guide for details.

Key Features:

Reads /proc/loadavg for 1-minute load average
Normalizes load by CPU core count
Sustained high load tracking (prevents alert spam from transient spikes)
Thermal throttling detection from /sys/devices/system/cpu/cpu*/thermal_throttle/core_throttle_count
Recovery event generation

Events Generated:

HighCPULoad (Warning/Error): CPU load exceeds thresholds
CPUThermalThrottle (Warning): Thermal throttling detected
CPULoadRecovered (Info): CPU load returned to normal

Conditions:

HighCPULoad: True when sustained high load detected

Example Status:

events:
  - severity: Warning
    reason: HighCPULoad
    message: "CPU load at 120% (1.5 load factor), exceeds 80% warning threshold"
conditions:
  - type: HighCPULoad
    status: "True"
    reason: SustainedHighLoad
    message: "CPU load has been high for 3 consecutive checks"

Memory Monitor

Monitors memory usage, swap usage, and OOM (Out-Of-Memory) kill events.

Monitor Type: system-memory-check

Source File: pkg/monitors/system/memory.go:653

Configuration:

monitors:
  - name: memory-health
    type: system-memory-check
    interval: 30s
    timeout: 5s
    config:
      warningThreshold: 85          # 85% memory usage (auto-validated: 0-100)
      criticalThreshold: 95         # 95% memory usage (auto-validated: 0-100)
      swapWarningThreshold: 50      # 50% swap usage (auto-validated: 0-100)
      swapCriticalThreshold: 80     # 80% swap usage (auto-validated: 0-100)
      sustainedHighMemoryChecks: 3
      checkOOMKills: true

Default Values:

warningThreshold: 85%
criticalThreshold: 95%
swapWarningThreshold: 50%
swapCriticalThreshold: 80%
sustainedHighMemoryChecks: 3
checkOOMKills: true

Note: Threshold values are automatically validated to be in the range 0-100. See Configuration Guide for details.

Key Features:

Reads /proc/meminfo for memory statistics
Monitors /dev/kmsg for OOM kill events
ARM64 compatibility (handles /dev/kmsg EINVAL gracefully)
Sustained high memory tracking
Swap usage monitoring
OOM kill detection with process details

Events Generated:

HighMemoryUsage (Warning/Error): Memory usage exceeds thresholds
HighSwapUsage (Warning/Error): Swap usage exceeds thresholds
OOMKillDetected (Error): OOM killer has terminated a process
MemoryRecovered (Info): Memory usage returned to normal

Conditions:

HighMemoryUsage: True when sustained high memory detected

Special Notes:

ARM64 /dev/kmsg Issue: On ARM64 systems, /dev/kmsg may return EINVAL errors. The monitor handles this gracefully and continues without OOM kill detection.

Example Status:

events:
  - severity: Error
    reason: OOMKillDetected
    message: "OOM killer terminated process 'node-doctor' (PID 12345)"
  - severity: Warning
    reason: HighMemoryUsage
    message: "Memory usage at 87.5%, exceeds 85% warning threshold"

Disk Monitor

Monitors disk space, inode usage, read-only filesystems, and I/O health across multiple mount points.

Monitor Type: system-disk-check

Source File: pkg/monitors/system/disk.go:870

Configuration:

monitors:
  - name: disk-health
    type: system-disk-check
    interval: 60s
    timeout: 10s
    config:
      mountPoints:
        - path: /
          warningThreshold: 85
          criticalThreshold: 95
          inodeWarningThreshold: 85
          inodeCriticalThreshold: 95
        - path: /var/lib/kubelet
          warningThreshold: 90
          criticalThreshold: 95
      checkReadOnlyFilesystem: true
      checkIOHealth: true
      ioUtilizationWarning: 80      # 80% I/O utilization
      ioUtilizationCritical: 95     # 95% I/O utilization

Default Values per Mount Point:

warningThreshold: 85%
criticalThreshold: 95%
inodeWarningThreshold: 85%
inodeCriticalThreshold: 95%

Note: All threshold values (warningThreshold, criticalThreshold, inodeWarningThreshold, inodeCriticalThreshold, ioUtilizationWarning, ioUtilizationCritical) are automatically validated to be in the range 0-100. See Configuration Guide for details.

Key Features:

Multi-mount point support with individual thresholds
Space and inode monitoring via syscall.Statfs()
Read-only filesystem detection (remount attempts)
I/O health monitoring via /proc/diskstats
I/O utilization tracking (percent time device is busy)
Per-device metrics aggregation

Events Generated:

HighDiskUsage (Warning/Error): Disk space exceeds thresholds
HighInodeUsage (Warning/Error): Inode usage exceeds thresholds
ReadOnlyFilesystem (Error): Filesystem is mounted read-only
HighIOUtilization (Warning/Error): I/O utilization exceeds thresholds
DiskRecovered (Info): Disk usage returned to normal

Conditions:

HighDiskUsage: True when disk space critical
ReadOnlyFilesystem: True when filesystem is read-only

Example Status:

events:
  - severity: Error
    reason: HighDiskUsage
    message: "Disk usage at /: 96.2% exceeds 95% critical threshold"
  - severity: Warning
    reason: HighInodeUsage
    message: "Inode usage at /var/lib/kubelet: 87.3% exceeds 85% warning threshold"

Network Monitors

DNS Monitor

Comprehensive DNS resolution monitoring for cluster-internal, external, and custom domains with advanced intermittent failure detection capabilities.

Monitor Type: network-dns-check

Source File: pkg/monitors/network/dns.go

Basic Configuration

monitors:
  - name: dns-health
    type: network-dns-check
    interval: 30s
    timeout: 10s
    config:
      clusterDomains:
        - kubernetes.default.svc.cluster.local
        - kube-dns.kube-system.svc.cluster.local
      externalDomains:
        - google.com
        - cloudflare.com
      customQueries:
        - domain: api.example.com
          recordType: A
        - domain: ldap.internal.corp
          recordType: A
          testEachNameserver: true    # Test against each DNS server
        - domain: critical-service.internal
          recordType: A
          consistencyCheck: true      # Enable rapid consistency checks
      latencyThreshold: 1s
      nameserverCheckEnabled: true
      failureCountThreshold: 3
      resolverPath: /etc/resolv.conf

Default Values:

clusterDomains: ["kubernetes.default.svc.cluster.local"]
externalDomains: ["google.com", "cloudflare.com"]
latencyThreshold: 1 second
nameserverCheckEnabled: false
failureCountThreshold: 3
resolverPath: /etc/resolv.conf

Enhanced Features

1. Per-Nameserver Domain Testing

Test a specific domain against each configured nameserver individually to identify which DNS server is causing issues.

config:
  customQueries:
    - domain: "critical-service.internal.corp"
      recordType: "A"
      testEachNameserver: true  # Test against each nameserver in /etc/resolv.conf

Use Cases:

Identify which upstream DNS server is failing
Detect intermittent issues caused by DNS server round-robin
Troubleshoot split-horizon DNS problems

Generated Conditions:

CustomDNSDown: All nameservers failed to resolve the domain
DNSResolutionDegraded: Some nameservers working, others failing
CustomDNSHealthy: All nameservers responding successfully

2. Success Rate Tracking (Sliding Window)

Track DNS resolution success rate over a sliding window to detect intermittent failures that consecutive-failure tracking misses.

config:
  successRateTracking:
    enabled: true
    windowSize: 10              # Track last 10 checks
    failureRateThreshold: 30    # Alert if >30% failures (accepts 0-100 or 0.0-1.0)
    minSamplesRequired: 5       # Need at least 5 samples before alerting

Default Values:

enabled: false (disabled by default)
windowSize: 10
failureRateThreshold: 0.3 (30%)
minSamplesRequired: 5

Problem Solved:

Traditional consecutive failure tracking misses intermittent issues:

Check 1: SUCCESS
Check 2: FAIL
Check 3: SUCCESS
Check 4: FAIL
Check 5: FAIL
Check 6: SUCCESS
Check 7: FAIL

With failureCountThreshold: 3, no alert fires because consecutive failures never reach 3. But 57% of checks are failing!

Generated Conditions:

ClusterDNSDegraded: Cluster DNS failure rate exceeds threshold
ClusterDNSIntermittent: Some cluster DNS failures (below threshold)
ExternalDNSDegraded: External DNS failure rate exceeds threshold
ExternalDNSIntermittent: Some external DNS failures (below threshold)

3. DNS Error Type Classification

Automatically classifies DNS errors to help identify root causes faster.

Error Types:

Type	Indicates	Go Error Patterns
`Timeout`	Network issues, server overload, firewall	`i/o timeout`, `context deadline exceeded`
`NXDOMAIN`	Domain doesn't exist, typo, missing record	`no such host`, `DNSError.IsNotFound`
`SERVFAIL`	Upstream DNS error, DNSSEC validation failure	`server misbehaving`
`Refused`	DNS server down, wrong port, firewall	`connection refused`
`Temporary`	Transient failure, retry may succeed	`DNSError.IsTemporary`
`Unknown`	Unclassified error	Other errors

Example Event:

events:
  - severity: Error
    reason: CustomDNSQueryFailed
    message: "[NXDOMAIN] Failed to resolve custom domain ldap.corp.internal: no such host"

4. Consistency Checking (Rapid Multi-Query)

Perform multiple rapid DNS queries to detect intermittent DNS issues that single queries miss.

config:
  consistencyChecking:
    enabled: true
    queriesPerCheck: 5            # Make 5 rapid queries (range: 2-20)
    intervalBetweenQueries: 200ms # Delay between queries (range: 10ms-5s)
  customQueries:
    - domain: "critical-service.internal.corp"
      consistencyCheck: true      # Enable for this specific query

Default Values:

enabled: false (disabled by default)
queriesPerCheck: 5
intervalBetweenQueries: 200ms

Detection Capabilities:

All queries succeed with same IPs → DNSResolutionConsistent
All queries fail → DNSResolutionDown
Some succeed, some fail → DNSResolutionIntermittent
All succeed but different IPs → DNSResolutionInconsistent

Example Status:

events:
  - severity: Warning
    reason: ConsistencyCheckIntermittent
    message: "[Timeout] Intermittent DNS resolution for critical-service.internal.corp: 3/5 queries succeeded (avg latency: 150ms)"
conditions:
  - type: DNSResolutionIntermittent
    status: "True"
    reason: IntermittentConsistencyFailures
    message: "Domain critical-service.internal.corp: 3/5 queries succeeded (60.0% success rate)"

Complete Configuration Reference

monitors:
  - name: dns-health
    type: network-dns-check
    interval: 30s
    timeout: 10s
    config:
      # Basic domain testing
      clusterDomains:
        - kubernetes.default.svc.cluster.local
      externalDomains:
        - google.com
        - cloudflare.com

      # Custom domain queries with per-query options
      customQueries:
        - domain: "api.example.com"
          recordType: "A"                  # Currently only "A" supported
        - domain: "ldap.internal.corp"
          recordType: "A"
          testEachNameserver: true         # Test each nameserver individually
        - domain: "critical-service.internal"
          recordType: "A"
          consistencyCheck: true           # Enable consistency checking

      # Thresholds
      latencyThreshold: 1s                 # Max acceptable query latency
      failureCountThreshold: 3             # Consecutive failures before alert

      # Nameserver configuration
      nameserverCheckEnabled: true         # Check nameserver reachability
      resolverPath: /etc/resolv.conf       # Path to resolver config

      # Success rate tracking (sliding window)
      successRateTracking:
        enabled: true
        windowSize: 10
        failureRateThreshold: 30           # 30% (accepts 0-100 or 0.0-1.0)
        minSamplesRequired: 5

      # Consistency checking (rapid multi-query)
      consistencyChecking:
        enabled: true
        queriesPerCheck: 5                 # 2-20 queries per check
        intervalBetweenQueries: 200ms      # 10ms-5s between queries

Events Generated

Event	Severity	Description
`ClusterDNSResolutionFailed`	Error	Cluster DNS query failed
`ExternalDNSResolutionFailed`	Error	External DNS query failed
`CustomDNSQueryFailed`	Error	Custom domain query failed
`ClusterDNSNoRecords`	Warning	No A records for cluster domain
`ExternalDNSNoRecords`	Warning	No A records for external domain
`CustomDNSNoRecords`	Warning	No A records for custom domain
`HighClusterDNSLatency`	Warning	Cluster DNS latency exceeds threshold
`HighExternalDNSLatency`	Warning	External DNS latency exceeds threshold
`HighCustomDNSLatency`	Warning	Custom DNS latency exceeds threshold
`NameserverUnreachable`	Warning	Nameserver failed to respond
`NameserverDomainResolutionFailed`	Warning	Specific nameserver failed for domain
`NameserverDomainLatencyHigh`	Warning	Nameserver latency exceeds threshold
`ConsistencyCheckIntermittent`	Warning	Some consistency check queries failed
`ConsistencyCheckAllFailed`	Error	All consistency check queries failed
`ConsistencyCheckIPVariation`	Warning	Varying IPs across queries
`ConsistencyCheckHighLatency`	Warning	Average latency exceeds threshold
`ConsistencyCheckCancelled`	Warning	Check cancelled (context timeout)
`UnsupportedQueryType`	Warning	Record type other than A requested
`ResolverConfigParseError`	Warning	Failed to parse /etc/resolv.conf
`NoNameserversConfigured`	Warning	No nameservers found in config

Conditions

Node conditions are prefixed with NodeDoctor when exported to Kubernetes:

Condition (as shown in kubectl)	Description
`NodeDoctorClusterDNSDown`	Cluster DNS failed repeatedly
`NodeDoctorClusterDNSHealthy`	Cluster DNS resolution is healthy
`NodeDoctorNetworkUnreachable`	External DNS failed repeatedly
`NodeDoctorNetworkReachable`	External DNS resolution is healthy
`NodeDoctorClusterDNSDegraded`	Cluster DNS failure rate exceeds threshold
`NodeDoctorClusterDNSIntermittent`	Cluster DNS has intermittent failures
`NodeDoctorExternalDNSDegraded`	External DNS failure rate exceeds threshold
`NodeDoctorExternalDNSIntermittent`	External DNS has intermittent failures
`NodeDoctorCustomDNSDown`	All nameservers failed for custom domain
`NodeDoctorCustomDNSHealthy`	All nameservers healthy for custom domain
`NodeDoctorDNSResolutionDegraded`	Partial nameserver failure for domain
`NodeDoctorDNSResolutionConsistent`	Consistency check: all queries consistent
`NodeDoctorDNSResolutionIntermittent`	Consistency check: some queries failed
`NodeDoctorDNSResolutionInconsistent`	Consistency check: varying IP addresses (expected for CDN/load-balanced domains)
`NodeDoctorDNSResolutionDown`	Consistency check: all queries failed

Querying DNS Conditions:

# View all DNS conditions for a node
kubectl describe node <node-name> | grep -E "NodeDoctorDNS|NodeDoctorCluster|NodeDoctorCustomDNS"

# Example output:
#   NodeDoctorDNSResolutionConsistent   True   ConsistentResolution   Domain google.com: all 5 queries succeeded with consistent IPs
#   NodeDoctorCustomDNSHealthy          True   AllNameserversHealthy  Domain cloudflare.com: all 1 nameservers responding
#   NodeDoctorClusterDNSHealthy         True   ClusterDNSResolved     Cluster DNS resolution is healthy

Example: Enterprise LDAP DNS Monitoring

monitors:
  - name: ldap-dns-critical
    type: network-dns-check
    interval: 30s
    timeout: 10s
    config:
      clusterDomains: []              # Disable cluster DNS checks
      externalDomains: []             # Disable external DNS checks
      customQueries:
        - domain: "ldap.corp.internal"
          recordType: "A"
          testEachNameserver: true    # Identify failing DNS servers
          consistencyCheck: true      # Detect intermittent failures
        - domain: "auth.corp.internal"
          recordType: "A"
          testEachNameserver: true
      latencyThreshold: 500ms         # Stricter latency for auth services
      nameserverCheckEnabled: true
      failureCountThreshold: 2        # Alert faster for critical domains
      successRateTracking:
        enabled: true
        windowSize: 20                # Track more samples
        failureRateThreshold: 10      # 10% failure rate threshold
        minSamplesRequired: 5
      consistencyChecking:
        enabled: true
        queriesPerCheck: 10           # More queries for higher confidence
        intervalBetweenQueries: 100ms # Faster queries

Example Status Output

events:
  - severity: Warning
    reason: NameserverDomainResolutionFailed
    message: "[Timeout] Nameserver 10.154.57.53 failed to resolve ldap.corp.internal: i/o timeout"
  - severity: Warning
    reason: ConsistencyCheckIntermittent
    message: "[Timeout] Intermittent DNS resolution for ldap.corp.internal: 8/10 queries succeeded (avg latency: 120ms)"
conditions:
  - type: DNSResolutionDegraded
    status: "True"
    reason: PartialNameserverFailure
    message: "Domain ldap.corp.internal: 2/3 nameservers responding (failed: 10.154.57.53)"
  - type: DNSResolutionIntermittent
    status: "True"
    reason: IntermittentConsistencyFailures
    message: "Domain ldap.corp.internal: 8/10 queries succeeded (80.0% success rate)"

Best Practices for DNS Monitoring

Identifying Critical DNS Dependencies

Before configuring DNS monitoring, identify domains critical to your infrastructure:

Dependency Type	Examples	Risk Level
Authentication	LDAP/AD servers, OAuth providers	Critical - auth failures block users
Databases	PostgreSQL, MySQL, MongoDB hostnames	Critical - app failures
Service Mesh	Consul, Istio service discovery	High - service routing failures
External APIs	Payment gateways, third-party services	High - feature degradation
Container Registries	gcr.io, docker.io, custom registries	Medium - deployment failures
Cluster Services	kubernetes.default.svc.cluster.local	Critical - pod communication

Common DNS Issues in Kubernetes Clusters

Issue	Symptoms	Detection Method
Upstream DNS overload	Intermittent timeouts in large clusters (100+ nodes)	Success rate tracking, consistency checking
Custom TLD misconfiguration	NXDOMAIN for .local, .internal, .test domains	Error type classification shows NXDOMAIN
Split-horizon DNS	Different results from different nameservers	Per-nameserver testing
CoreDNS pod failures	Cluster DNS fails, external works	Compare clusterDomains vs externalDomains results
DNS cache TTL issues	Stale IPs after service migration	Consistency checking shows IP variation
Network policy blocking	Timeout to specific nameservers	Per-nameserver testing with error classification

Configuration Examples by Use Case

1. Basic External Connectivity Check:

monitors:
  - name: external-dns
    type: network-dns-check
    interval: 60s
    config:
      clusterDomains: []                    # Skip cluster DNS
      externalDomains:
        - google.com
        - cloudflare.com
      latencyThreshold: 2s
      failureCountThreshold: 3

2. Custom TLD Monitoring (.internal, .local):

monitors:
  - name: internal-dns
    type: network-dns-check
    interval: 30s
    config:
      clusterDomains: []
      externalDomains: []
      customQueries:
        - domain: "app.internal.corp"
          recordType: "A"
          testEachNameserver: true          # Find which DNS server fails
        - domain: "db.internal.corp"
          recordType: "A"
          testEachNameserver: true
      nameserverCheckEnabled: true
      failureCountThreshold: 2

3. High-Availability DNS Validation:

monitors:
  - name: ha-dns-check
    type: network-dns-check
    interval: 15s                           # More frequent checks
    config:
      customQueries:
        - domain: "api-gateway.prod.svc.cluster.local"
          consistencyCheck: true            # Detect intermittent failures
          testEachNameserver: true          # Check all DNS servers
      successRateTracking:
        enabled: true
        windowSize: 20
        failureRateThreshold: 5             # 5% threshold for critical services
        minSamplesRequired: 10
      consistencyChecking:
        enabled: true
        queriesPerCheck: 10
        intervalBetweenQueries: 50ms        # Aggressive testing

4. Database Hostname Monitoring:

monitors:
  - name: database-dns
    type: network-dns-check
    interval: 30s
    config:
      customQueries:
        - domain: "postgres-primary.db.svc.cluster.local"
          recordType: "A"
          consistencyCheck: true
        - domain: "postgres-replica.db.svc.cluster.local"
          recordType: "A"
          consistencyCheck: true
        - domain: "redis-master.cache.svc.cluster.local"
          recordType: "A"
      latencyThreshold: 100ms               # Low latency for DB connections
      failureCountThreshold: 1              # Alert immediately
      consistencyChecking:
        enabled: true
        queriesPerCheck: 5

Remediation Guidance

When DNS issues are detected, consider these remediation steps:

Condition	Cause	Remediation
`ClusterDNSDown`	CoreDNS pods unhealthy	Check `kubectl -n kube-system get pods -l k8s-app=kube-dns`
`DNSResolutionDegraded` (partial nameserver failure)	One upstream DNS server failing	Update `/etc/resolv.conf` or CoreDNS upstream servers
`DNSResolutionIntermittent`	Overloaded DNS servers	Increase CoreDNS replicas, enable DNS caching
NXDOMAIN errors	Missing DNS record or zone	Add record to DNS zone, check CoreDNS stub domains
High latency	Network congestion, distant DNS	Use local caching DNS, reduce TTL for faster updates
`DNSResolutionInconsistent` (varying IPs)	DNS load balancing, stale cache	Verify expected behavior, check TTL settings

CoreDNS Stub Domain Configuration:

If custom TLDs (.internal, .corp) are failing, configure CoreDNS stub domains:

apiVersion: v1
kind: ConfigMap
metadata:
  name: coredns
  namespace: kube-system
data:
  Corefile: |
    .:53 {
        errors
        health
        kubernetes cluster.local in-addr.arpa ip6.arpa {
           pods insecure
           fallthrough in-addr.arpa ip6.arpa
        }
        prometheus :9153
        forward . /etc/resolv.conf
        cache 30
        loop
        reload
        loadbalance
    }
    corp.internal:53 {
        errors
        cache 30
        forward . 10.0.0.53 10.0.0.54  # Internal DNS servers
    }

Temporary /etc/hosts Workaround:

For immediate mitigation while DNS is fixed:

apiVersion: v1
kind: Pod
spec:
  hostAliases:
    - ip: "10.0.1.100"
      hostnames:
        - "ldap.corp.internal"
    - ip: "10.0.1.101"
      hostnames:
        - "auth.corp.internal"

Integration with Kubernetes Events

DNS conditions appear as node conditions viewable via kubectl:

# View all DNS-related node conditions
kubectl describe node <node-name> | grep -E "NodeDoctorDNS|NodeDoctorCluster|NodeDoctorCustomDNS"

# Example output from a healthy node:
#   NodeDoctorDNSResolutionConsistent   True   ConsistentResolution     Domain google.com: all 5 queries succeeded with consistent IPs 142.251.32.46 (avg latency: 4.65ms)
#   NodeDoctorCustomDNSHealthy          True   AllNameserversHealthy    Domain cloudflare.com: all 1 nameservers responding
#   NodeDoctorClusterDNSHealthy         True   ClusterDNSResolved       Cluster DNS resolution is healthy
#   NodeDoctorDNSResolutionInconsistent True   InconsistentIPAddresses  Domain google.com: 3 unique IPs returned across 5 queries

# Check DNS conditions across all nodes
kubectl get nodes -o name | xargs -I {} sh -c 'echo "=== {} ===" && kubectl describe {} | grep -E "NodeDoctorDNS|NodeDoctorCluster|NodeDoctorCustomDNS"'

Alerting with Prometheus:

Node Doctor exports DNS metrics that can be used for alerting:

# Example Prometheus alert rules (using kube-state-metrics node conditions)
groups:
  - name: dns-alerts
    rules:
      - alert: DNSResolutionIntermittent
        expr: |
          kube_node_status_condition{condition="NodeDoctorDNSResolutionIntermittent", status="true"} == 1
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Intermittent DNS resolution on {{ $labels.node }}"
          description: "DNS resolution is intermittent, indicating upstream DNS issues"

      - alert: DNSResolutionDegraded
        expr: |
          kube_node_status_condition{condition="NodeDoctorDNSResolutionDegraded", status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Degraded DNS resolution on {{ $labels.node }}"
          description: "One or more DNS servers are failing for critical domains"

      - alert: ClusterDNSDown
        expr: |
          kube_node_status_condition{condition="NodeDoctorClusterDNSDown", status="true"} == 1
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "Cluster DNS is down on {{ $labels.node }}"
          description: "Cluster DNS resolution has repeatedly failed"

      - alert: CustomDNSUnhealthy
        expr: |
          kube_node_status_condition{condition="NodeDoctorCustomDNSDown", status="true"} == 1
        for: 2m
        labels:
          severity: critical
        annotations:
          summary: "Custom DNS domains failing on {{ $labels.node }}"
          description: "All nameservers are failing for custom domain queries"

Complete Testing Configuration:

This configuration enables all DNS monitoring features for comprehensive testing:

monitors:
  - name: dns-health
    type: network-dns-check
    enabled: true
    interval: 15s  # More frequent for testing
    timeout: 10s
    config:
      # Standard cluster DNS domains
      clusterDomains:
        - kubernetes.default.svc.cluster.local
        - kube-dns.kube-system.svc.cluster.local

      # External DNS domains
      externalDomains:
        - google.com
        - cloudflare.com

      # Custom queries with all features enabled
      customQueries:
        # Test per-nameserver checking
        - domain: "kubernetes.default.svc.cluster.local"
          recordType: "A"
          testEachNameserver: true

        # Test consistency checking
        - domain: "google.com"
          recordType: "A"
          consistencyCheck: true

        # Test both features together
        - domain: "cloudflare.com"
          recordType: "A"
          testEachNameserver: true
          consistencyCheck: true

      latencyThreshold: 1s
      nameserverCheckEnabled: true
      failureCountThreshold: 2

      # Success rate tracking - sliding window
      successRateTracking:
        enabled: true
        windowSize: 10
        failureRateThreshold: 20  # 20% failure rate threshold
        minSamplesRequired: 3

      # Consistency checking - rapid multi-query
      consistencyChecking:
        enabled: true
        queriesPerCheck: 5
        intervalBetweenQueries: 100ms

Expected Conditions When Testing:

Condition	Expected Value	Notes
`NodeDoctorClusterDNSHealthy`	True	Cluster DNS should resolve
`NodeDoctorCustomDNSHealthy`	True	Per-nameserver tests passing
`NodeDoctorDNSResolutionConsistent`	True	Queries returning consistent results
`NodeDoctorDNSResolutionInconsistent`	True (for google.com)	Expected - Google uses DNS round-robin, multiple IPs is normal

Note: NodeDoctorDNSResolutionInconsistent=True for domains like google.com is expected behavior. Large CDN/cloud providers use DNS load balancing which returns different IPs. This condition helps detect unexpected IP variation in domains where you expect a single IP.

SLO/SLA Tracking:

Use success rate tracking for DNS SLOs:

# Configuration for 99.9% DNS availability SLO
successRateTracking:
  enabled: true
  windowSize: 100                # Track 100 checks
  failureRateThreshold: 0.1      # 0.1% = 99.9% availability
  minSamplesRequired: 50         # Need 50 samples before alerting

Gateway Monitor

Monitors default gateway reachability via ICMP ping.

Monitor Type: network-gateway-check

Source File: pkg/monitors/network/gateway.go:300

Configuration:

monitors:
  - name: gateway-health
    type: network-gateway-check
    interval: 30s
    timeout: 5s
    config:
      autoDetectGateway: true       # Auto-detect from /proc/net/route
      manualGateway: ""             # Override with specific IP
      pingCount: 3
      pingTimeout: 1s
      latencyThreshold: 100ms
      failureCountThreshold: 3

Default Values:

autoDetectGateway: true
pingCount: 3
pingTimeout: 1 second
latencyThreshold: 100ms
failureCountThreshold: 3

Key Features:

Auto-detection of default gateway from /proc/net/route
ICMP echo request/reply (ping)
Average latency calculation
Packet loss detection
Manual gateway override option

Events Generated:

GatewayUnreachable (Error): Cannot reach default gateway
GatewayHighLatency (Warning): Gateway latency exceeds threshold
GatewayRecovered (Info): Gateway reachability restored

Conditions:

GatewayUnreachable: True when failure threshold exceeded

Example Status:

events:
  - severity: Error
    reason: GatewayUnreachable
    message: "Default gateway 192.168.1.1 is unreachable (0/3 packets received)"
  - severity: Warning
    reason: GatewayHighLatency
    message: "Gateway latency 150ms exceeds 100ms threshold"

Connectivity Monitor

Monitors HTTP/HTTPS endpoint connectivity for external services and APIs.

Monitor Type: network-connectivity-check

Source File: pkg/monitors/network/connectivity.go:300

Configuration:

monitors:
  - name: connectivity-health
    type: network-connectivity-check
    interval: 60s
    timeout: 30s
    config:
      endpoints:
        - url: https://kubernetes.default.svc.cluster.local
          name: kubernetes-api
          method: GET
          expectedStatusCode: 200
          timeout: 10s
          followRedirects: false
          headers:
            Authorization: "Bearer ${TOKEN}"
        - url: https://registry.example.com/v2/
          name: container-registry
          method: HEAD
          expectedStatusCode: 200

Default Values per Endpoint:

method: HEAD (safe, no response body)
expectedStatusCode: 200
timeout: 10 seconds
followRedirects: false

Key Features:

Multiple endpoint monitoring
HTTP method support: GET, HEAD, OPTIONS (POST/PUT/DELETE restricted for safety)
Custom headers (e.g., authentication tokens)
Expected status code validation
URL and protocol validation (http/https only)
Resource limits: maximum 50 endpoints

Events Generated:

EndpointUnreachable (Error): HTTP request failed
EndpointUnexpectedStatus (Warning): Status code mismatch
EndpointRecovered (Info): Endpoint reachable again

Security Considerations:

Only safe HTTP methods allowed (GET, HEAD, OPTIONS)
No POST/PUT/DELETE to prevent accidental data modification
URL validation prevents file:// and other protocols
Header sanitization in error messages

Example Status:

events:
  - severity: Error
    reason: EndpointUnreachable
    message: "Failed to reach https://registry.example.com: connection timeout"
  - severity: Warning
    reason: EndpointUnexpectedStatus
    message: "Endpoint kubernetes-api returned 401, expected 200"

CNI Monitor

Monitors CNI (Container Network Interface) health and cross-node network connectivity. This monitor is critical for detecting network partitions and ensuring nodes can communicate with each other in the cluster.

Monitor Type: network-cni-check

Source Files:

pkg/monitors/network/cni.go
pkg/monitors/network/cni_health.go
pkg/monitors/network/peer_discovery.go

Configuration:

monitors:
  - name: cni-health
    type: network-cni-check
    interval: 30s
    timeout: 15s
    config:
      discovery:
        method: kubernetes            # kubernetes or static
        namespace: node-doctor        # Namespace for peer discovery
        labelSelector: app=node-doctor
        refreshInterval: 5m           # How often to refresh peer list
        staticPeers: []               # For static method: list of IPs
      connectivity:
        pingCount: 3                  # Pings per peer
        pingTimeout: 5s               # Timeout per ping
        warningLatency: 50ms          # Latency warning threshold
        criticalLatency: 200ms        # Latency critical threshold
        failureThreshold: 3           # Consecutive failures before alert
        minReachablePeers: 80         # Minimum % of peers that must be reachable
      cniHealth:
        enabled: true                 # Enable CNI config/interface checks
        configPath: /etc/cni/net.d    # CNI configuration directory
        checkInterfaces: false        # Check for specific interfaces
        expectedInterfaces: []        # Expected interface names

Default Values:

Discovery:

method: kubernetes
namespace: node-doctor
labelSelector: app=node-doctor
refreshInterval: 5 minutes

Connectivity:

pingCount: 3
pingTimeout: 5 seconds
warningLatency: 50ms
criticalLatency: 200ms
failureThreshold: 3
minReachablePeers: 80%

CNI Health:

enabled: true
configPath: /etc/cni/net.d
checkInterfaces: false

Key Features:

Peer Discovery
- Kubernetes API-based discovery of other node-doctor instances
- Automatic exclusion of self (current node)
- Background refresh of peer list
- Static peer configuration for non-Kubernetes environments
Cross-Node Connectivity
- ICMP ping mesh to all discovered peers
- Latency measurement with configurable thresholds
- Network partition detection when peer reachability drops below threshold
- Per-peer failure tracking with consecutive failure counts
CNI Health Validation
- CNI configuration file detection and validation
- Support for multiple CNI plugins (Calico, Flannel, Weave, Cilium, etc.)
- Network interface discovery and health checking
- Expected interface validation
Network Partition Detection
- Reports NetworkPartitioned condition when insufficient peers reachable
- Configurable minimum reachable percentage threshold
- Automatic recovery detection

Discovery Methods:

Kubernetes Discovery (default)
- Uses Kubernetes API to list pods matching label selector
- Requires RBAC permissions for pod list/watch
- Automatically discovers node-doctor instances on other nodes
- Uses host network IPs for connectivity testing
Static Discovery
- Manual list of peer IP addresses
- Useful for testing or non-Kubernetes environments
- No Kubernetes API dependency

CNI Plugin Detection:

The monitor automatically detects common CNI plugins:

Calico: calico-* interfaces, 10-calico.conflist
Flannel: flannel.1 interface, 10-flannel.conflist
Weave: weave* interfaces
Cilium: cilium_host, cilium_net, lxc* interfaces
Canal: Combined Calico/Flannel

Events Generated:

NoPeersFound (Warning): No peer instances discovered
PeerUnreachable (Error): Peer node unreachable after failure threshold
HighPeerLatency (Warning): Latency to peer exceeds threshold
CNIConnectivitySummary (Info): Summary of peer connectivity
CNIConfigError (Error): CNI configuration validation failed
CNIInterfaceWarning (Warning): Expected interfaces not found

Conditions:

NetworkPartitioned: True when insufficient peers reachable
NetworkDegraded: True when high latency detected to peers
CNIHealthy: Overall CNI plugin health status
CNIConfigValid: CNI configuration file validity
CNIInterfacesHealthy: Expected network interfaces present

Example Status:

events:
  - severity: Info
    reason: CNIConnectivitySummary
    message: "Peer connectivity: 4/5 reachable (80%), avg latency=12.50ms"
  - severity: Warning
    reason: HighPeerLatency
    message: "High latency to peer node-3: 250.00ms (critical threshold: 200.00ms)"
conditions:
  - type: NetworkPartitioned
    status: "False"
    reason: SufficientPeerReachability
    message: "80% of peers are reachable (4/5)"
  - type: NetworkDegraded
    status: "True"
    reason: HighLatencyDetected
    message: "High latency detected to 1 peers: [node-3 (250.00ms)]"
  - type: CNIHealthy
    status: "True"
    reason: CNIOperational
    message: "CNI plugin configuration and interfaces are healthy"

Network Partition Example:

events:
  - severity: Error
    reason: PeerUnreachable
    message: "Peer node-2 has been unreachable for 3 consecutive checks"
  - severity: Error
    reason: PeerUnreachable
    message: "Peer node-3 has been unreachable for 5 consecutive checks"
conditions:
  - type: NetworkPartitioned
    status: "True"
    reason: InsufficientPeerReachability
    message: "Only 40% of peers are reachable (threshold: 80%). Unreachable: [node-2, node-3]"

RBAC Requirements:

The CNI monitor requires pod list permissions for Kubernetes-based peer discovery:

rules:
- apiGroups: [""]
  resources: ["pods"]
  verbs: ["get", "list", "watch"]

Architecture Notes:

Since node-doctor runs with hostNetwork: true, the CNI monitor:

Uses node IPs (not pod IPs) for connectivity testing
Can detect node-level network issues that affect CNI
Works even if CNI is completely broken (since it bypasses pod network)
Tests actual node-to-node connectivity that CNI relies on

Use Cases:

Network Partition Detection: Detect when a node becomes isolated from the cluster
CNI Health Monitoring: Validate CNI plugin is properly configured
Network Latency Alerting: Identify network performance degradation
Cross-Node Connectivity: Ensure nodes can communicate for pod-to-pod traffic
Node Health Correlation: A node that can't reach other nodes should be considered unhealthy

Kubernetes Monitors

Kubelet Monitor

Monitors Kubelet health, metrics, and PLEG (Pod Lifecycle Event Generator) performance.

Monitor Type: kubernetes-kubelet-check

Source File: pkg/monitors/kubernetes/kubelet.go:350

Configuration:

monitors:
  - name: kubelet-health
    type: kubernetes-kubelet-check
    interval: 30s
    timeout: 10s
    config:
      healthzURL: http://127.0.0.1:10248/healthz
      metricsURL: http://127.0.0.1:10250/metrics
      checkPLEG: true
      plegThreshold: 5s             # PLEG relist duration threshold
      failureThreshold: 3
      auth:
        type: serviceaccount        # none, serviceaccount, bearer, certificate
        tokenPath: /var/run/secrets/kubernetes.io/serviceaccount/token
        certPath: /path/to/client.crt
        keyPath: /path/to/client.key
        caPath: /path/to/ca.crt
      circuitBreaker:
        enabled: true
        threshold: 5                # Open after 5 consecutive failures
        timeout: 30s                # Half-open after 30s
        maxHalfOpenRequests: 3

Default Values:

healthzURL: http://127.0.0.1:10248/healthz
metricsURL: http://127.0.0.1:10250/metrics
checkPLEG: true
plegThreshold: 5 seconds
failureThreshold: 3
auth.type: none

Authentication Types:

none: No authentication (healthz endpoint)
serviceaccount: Use mounted ServiceAccount token
bearer: Custom bearer token
certificate: mTLS with client certificate

Circuit Breaker Pattern:

Closed: Normal operation, requests flow through
Open: Too many failures, fail fast without requests
Half-Open: Testing if service recovered, limited requests

Key Features:

Kubelet /healthz endpoint monitoring
Metrics endpoint authentication support
PLEG relist duration monitoring (Kubernetes API responsiveness)
Circuit breaker for fault protection
Multiple authentication methods
TLS verification with custom CA

Events Generated:

KubeletUnhealthy (Error): Kubelet healthz check failed
KubeletAuthFailure (Error): Authentication to metrics endpoint failed
PLEGSlow (Warning): PLEG relist duration exceeds threshold
KubeletRecovered (Info): Kubelet health restored

Conditions:

KubeletUnhealthy: True when failure threshold exceeded

Example Status:

events:
  - severity: Warning
    reason: PLEGSlow
    message: "PLEG relist duration 7.2s exceeds 5s threshold"
conditions:
  - type: KubeletUnhealthy
    status: "True"
    reason: HealthzFailed
    message: "Kubelet healthz endpoint failed 3 consecutive checks"

API Server Monitor

Monitors Kubernetes API server connectivity, latency, and authentication.

Monitor Type: kubernetes-apiserver-check

Source File: pkg/monitors/kubernetes/apiserver.go:514

Configuration:

monitors:
  - name: apiserver-health
    type: kubernetes-apiserver-check
    interval: 30s
    timeout: 15s
    config:
      endpoint: https://kubernetes.default.svc.cluster.local
      latencyThreshold: 2s
      checkVersion: true
      checkAuth: true
      failureThreshold: 3
      httpTimeout: 10s

Default Values:

endpoint: https://kubernetes.default.svc.cluster.local (in-cluster)
latencyThreshold: 2 seconds
checkVersion: true
checkAuth: true
failureThreshold: 3
httpTimeout: 10 seconds

Key Features:

Uses Kubernetes client-go for API interaction
In-cluster ServiceAccount authentication (automatic)
GET /version endpoint for lightweight health checking
Latency measurement and threshold alerting
Authentication failure detection (401/403)
Rate limiting detection (429)
Error sanitization (prevents token leakage in logs)

Events Generated:

APIServerUnreachable (Error): Cannot reach API server
APIServerSlow (Warning): API latency exceeds threshold
APIServerAuthFailure (Error): Authentication failed
APIServerRateLimited (Warning): Rate limit detected
APIServerRecovered (Info): API server reachable again

Conditions:

APIServerUnreachable: True when failure threshold exceeded

Security:

Error message sanitization prevents sensitive data leakage
No authentication tokens in logs or events
Generic error messages for auth/TLS failures

Example Status:

events:
  - severity: Error
    reason: APIServerUnreachable
    message: "API server unreachable after 3 consecutive failures: connection refused"
  - severity: Warning
    reason: APIServerSlow
    message: "API server latency 3.5s exceeds threshold 2.0s"

Runtime Monitor

Monitors container runtime health (Docker, containerd, CRI-O).

Monitor Type: kubernetes-runtime-check

Source File: pkg/monitors/kubernetes/runtime.go:618

Configuration:

monitors:
  - name: runtime-health
    type: kubernetes-runtime-check
    interval: 30s
    timeout: 10s
    config:
      runtimeType: auto             # auto, docker, containerd, crio
      dockerSocket: /var/run/docker.sock
      containerdSocket: /run/containerd/containerd.sock
      crioSocket: /var/run/crio/crio.sock
      checkSocketConnectivity: true
      checkSystemdStatus: true
      checkRuntimeInfo: true
      failureThreshold: 3
      timeout: 5s

Default Values:

runtimeType: auto (auto-detect)
Socket paths: Standard locations for each runtime
checkSocketConnectivity: true
checkSystemdStatus: true
checkRuntimeInfo: true
failureThreshold: 3
timeout: 5 seconds

Runtime Auto-Detection:

Check for Docker socket at /var/run/docker.sock
Check for containerd socket at /run/containerd/containerd.sock
Check for CRI-O socket at /var/run/crio/crio.sock
Use first detected runtime

Check Types:

Socket Connectivity: Unix socket connection test
Systemd Status: systemctl is-active <service> check
Runtime Info: Basic API connectivity verification

Key Features:

Multi-runtime support with auto-detection
Socket accessibility testing
Systemd service health checking
Systemd state awareness (active, inactive, failed, activating, etc.)
Custom socket path override
Graceful handling of missing runtimes

Events Generated:

RuntimeSocketUnreachable (Warning): Cannot connect to runtime socket
RuntimeSystemdInactive (Warning): Systemd service not active
RuntimeInfoFailed (Warning): Failed to retrieve runtime info
RuntimeHealthy (Info): All runtime checks passed
RuntimeRecovered (Info): Runtime health restored

Conditions:

ContainerRuntimeUnhealthy: True when failure threshold exceeded

Example Status:

events:
  - severity: Warning
    reason: RuntimeSystemdInactive
    message: "Container runtime systemd service (docker) is not active"
conditions:
  - type: ContainerRuntimeUnhealthy
    status: "True"
    reason: HealthCheckFailed
    message: "Container runtime (docker) has failed health checks for 3 consecutive attempts"

Capacity Monitor

Monitors pod capacity on Kubernetes nodes and alerts when approaching limits.

Monitor Type: kubernetes-capacity-check

Source File: pkg/monitors/kubernetes/capacity.go:512

Configuration:

monitors:
  - name: capacity-health
    type: kubernetes-capacity-check
    interval: 60s
    timeout: 15s
    config:
      nodeName: ""                  # Auto-detected from NODE_NAME env var
      warningThreshold: 90          # 90% capacity
      criticalThreshold: 95         # 95% capacity
      failureThreshold: 3
      apiTimeout: 10s
      checkAllocatable: true        # Use allocatable vs capacity

Default Values:

nodeName: Auto-detected from NODE_NAME environment variable
warningThreshold: 90%
criticalThreshold: 95%
failureThreshold: 3
apiTimeout: 10 seconds
checkAllocatable: true

Note: Threshold values (warningThreshold, criticalThreshold) are automatically validated to be in the range 0-100. See Configuration Guide for details.

Allocatable vs Capacity:

Allocatable: Pods actually schedulable (capacity minus system reservations)
Capacity: Total pod slots on the node

Key Features:

Kubernetes API integration for pod counting
Only counts pods in Running phase
Node name auto-detection from environment
Separate warning and critical thresholds
State transition tracking (normal → warning → critical)
Recovery event generation
In-cluster authentication via ServiceAccount

Events Generated:

PodCapacityWarning (Warning): Capacity 90-94%
PodCapacityPressure (Error): Capacity ≥95%
PodCapacityRecovered (Info): Capacity returned to normal
CapacityCheckFailed (Error): Failed to query capacity

Conditions:

PodCapacityPressure: True when critical threshold exceeded
PodCapacityUnhealthy: True when repeated check failures

Example Status:

events:
  - severity: Error
    reason: PodCapacityPressure
    message: "Node pod capacity at 96.5% (110/114 pods)"
conditions:
  - type: PodCapacityPressure
    status: "True"
    reason: HighPodUtilization
    message: "Pod capacity at 96.5% (110/114), exceeds 95% threshold"

Custom Monitors

Plugin Monitor

Executes custom external plugins for health checking with JSON or simple output formats.

Monitor Type: custom-plugin-check

Source File: pkg/monitors/custom/plugin.go:300

Configuration:

monitors:
  - name: custom-gpu-health
    type: custom-plugin-check
    interval: 60s
    timeout: 30s
    config:
      pluginPath: /usr/local/bin/check-gpu-health
      args:
        - --verbose
        - --threshold=80
      outputFormat: json            # json or simple
      failureThreshold: 3
      apiTimeout: 10s
      env:
        GPU_DEVICE: "0"
        LOG_LEVEL: "info"

Default Values:

outputFormat: json
failureThreshold: 3
apiTimeout: 10 seconds

Output Formats:

JSON Output:

{
  "status": "healthy",
  "message": "GPU utilization at 45%",
  "events": [
    {
      "severity": "info",
      "reason": "GPUNormal",
      "message": "GPU temperature: 65C"
    }
  ]
}

Valid status values: healthy, warning, critical, unknown

Simple Output:

OK: GPU utilization at 45%

Exit codes:

0: Healthy
1: Warning
2: Critical
Other: Unknown/Error

Key Features:

External plugin execution with timeout
JSON and simple (Nagios-style) output parsing
Custom environment variable support
Command-line argument passing
Failure threshold tracking
Plugin state validation
Error handling and recovery

Events Generated:

Plugin-defined events (from JSON output)
PluginCheckFailed (Error): Plugin execution failed
PluginUnknownState (Warning): Plugin returned unknown status

Conditions:

PluginUnhealthy: True when plugin reports critical or repeated failures

Example Plugin (Bash):

#!/bin/bash
# GPU Health Check Plugin

gpu_temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)

if [ "$gpu_temp" -lt 80 ]; then
    status="healthy"
    severity="info"
elif [ "$gpu_temp" -lt 90 ]; then
    status="warning"
    severity="warning"
else
    status="critical"
    severity="error"
fi

cat <<EOF
{
  "status": "$status",
  "message": "GPU temperature: ${gpu_temp}C",
  "events": [
    {
      "severity": "$severity",
      "reason": "GPUTemperature",
      "message": "Current GPU temperature: ${gpu_temp}C"
    }
  ]
}
EOF

Log Pattern Monitor

Monitors system logs for specific patterns with ReDoS protection and deduplication.

Monitor Type: custom-logpattern-check

Source File: pkg/monitors/custom/logpattern.go:350

Configuration:

monitors:
  - name: log-pattern-health
    type: custom-logpattern-check
    interval: 30s
    timeout: 10s
    config:
      useDefaults: true             # Include default critical patterns
      patterns:
        - pattern: 'kernel: Out of memory'
          severity: error
          reason: OOMDetected
          message: "Out of memory condition detected in kernel logs"
        - pattern: 'segmentation fault'
          severity: warning
          reason: SegFault
          message: "Segmentation fault detected"
        - pattern: 'DENIED'
          severity: info
          reason: SELinuxDenial
          message: "SELinux denial detected"
      # Kernel log monitoring (choose one):
      checkKernelJournal: true      # PRIMARY: Use journalctl -k (recommended)
      checkKmsg: false              # FALLBACK: Direct /dev/kmsg access
      kmsgPath: /dev/kmsg           # Path for checkKmsg fallback
      # Service unit log monitoring:
      checkJournal: true            # Enable systemd journal monitoring
      journalUnits:
        - kubelet
        - docker
        - containerd
      maxEventsPerPattern: 10       # Max events per pattern per check (1-1000)
      dedupWindow: 5m               # Deduplication window (1s-1h)

Default Values:

useDefaults: true
checkKernelJournal: true (PRIMARY - uses journalctl -k)
checkKmsg: false (FALLBACK - direct /dev/kmsg access)
kmsgPath: /dev/kmsg
checkJournal: true (service unit logs)
maxEventsPerPattern: 10 (range: 1-1000)
dedupWindow: 5 minutes (range: 1s-1h)

Kernel Log Monitoring Methods:

Method	Config Option	Command	Use Case
Journal (Primary)	`checkKernelJournal: true`	`journalctl -k --since`	Recommended - uses systemd journal, supports time-based filtering
Kmsg (Fallback)	`checkKmsg: true`	Read `/dev/kmsg`	Non-systemd systems or when journalctl unavailable

Note: The container image includes the journalctl binary from the systemd package to support kernel journal monitoring. If both checkKernelJournal and checkKmsg are enabled, kernel journal takes precedence.

Default Patterns (when useDefaults=true):

System & Hardware:

OOM kills: killed process|Out of memory|oom-kill
Kernel panics: Kernel panic|BUG: unable to handle
Hardware errors: Machine check events|Hardware Error
Filesystem errors: EXT4-fs error|XFS.*error|I/O error
Storage soft lockup: soft lockup.*(?:longhorn|mpt|scsi|iscsi|nvme|nfs) (error)

VMware:

vmxnet3 TX hang: vmxnet3.*tx hang (error)
vmxnet3 NIC reset: vmxnet3.*resetting (warning)
NSX errors: nsx.*(?:error|failed|timeout) (error)

Networking - Conntrack/Netfilter:

Conntrack table full: nf_conntrack.*table full (error)
Conntrack dropping: nf_conntrack.*dropping packet (error)
Netfilter error: (?:nf_tables|nftables|netfilter).*error (error)
iptables error: iptables.*(?:error|failed|invalid argument) (error)
iptables sync failed: (?:iptables|kube-proxy).*sync.*failed (error)

Networking - NIC/Driver:

NIC link down: (?:e1000e|igb|ixgbe|mlx|bnxt|r8169).*Link is Down (warning)
NIC driver error: (?:e1000|igb|ixgbe|mlx[45]|bnxt|i40e).*error (error)
NIC TX timeout: NETDEV WATCHDOG.*transmit.*timed out (error)
NIC firmware error: (?:firmware|nvram).*failed|Unable to load firmware (error)
Carrier lost: carrier (?:lost|off) (warning)

Networking - Network Stack:

Socket buffer overrun: (?:socket buffer.*overrun|packets.*pruned.*socket|RcvbufErrors) (warning)
TCP retransmit error: TCP.*retransmit.*timeout|tcp_retries.*exceeded (warning)
ARP resolution failed: (?:ARP.*failed|no ARP.*reply|neighbor.*FAILED) (warning)
Route error: (?:RTNETLINK.*error|route.*failed|no route to host) (error)

Networking - CNI:

Calico error: (?:calico|felix).*error (error)
Flannel error: flannel.*(?:error|panic|failed) (error)
Cilium error: cilium-agent.*(?:error|panic)|BPF.*load.*failed (error)
CNI plugin failed: CNI plugin.*(?:error|failed|timeout) (error)

Networking - kube-proxy/IPVS:

IPVS sync error: ipvs.*(?:error|failed|sync failed) (error)
kube-proxy error: kube-proxy.*(?:error|failed) (error)
Endpoint sync failed: (?:endpoint.*syncing|UpdateEndpoints).*failed (warning)

Networking - Pod Networking:

veth error: veth.*(?:error|failed|cannot create) (error)
Network namespace error: (?:netns|network namespace).*(?:error|failed) (error)
Pod network setup failed: failed to (?:set up|setup).*network (error)

Cloud Provider:

AWS ENI error: (?:eni|ENI|vpc-cni).*(?:error|failed) (error)
Azure network error: azure.*(?:cni|network).*(?:error|failed) (error)

Resource Limits:

Maximum 60 patterns
Maximum 20 journal units
Pattern complexity scoring to prevent ReDoS
Deduplication window: 1 second to 1 hour

ReDoS Protection:

The monitor validates regex patterns for safety:

Complexity Scoring: Assigns penalty points for risky patterns
- Nested quantifiers: (a+)+ (high risk)
- Overlapping alternations: (a|ab)* (moderate risk)
- Greedy quantifiers: .* (low risk)
Threshold Rejection: Patterns exceeding complexity score rejected
Timeout Enforcement: Context-based timeout for regex matching

Key Features:

Kernel journal monitoring via journalctl -k (primary, recommended)
Kernel message monitoring via /dev/kmsg (fallback)
Systemd service unit journal monitoring (kubelet, containerd, docker)
Regex pattern matching with safety validation
Deduplication to prevent event flooding
Custom pattern support
Default critical pattern library
Event rate limiting per pattern
Time-based filtering (only processes new logs since last check)
ARM64 compatibility

Events Generated:

Pattern-defined events (custom severity/reason/message)
LogPatternCheckFailed (Error): Failed to read logs

Example Status:

events:
  - severity: Error
    reason: OOMDetected
    message: "Out of memory condition detected in kernel logs"
  - severity: Warning
    reason: SegFault
    message: "Segmentation fault detected (2 occurrences in last 5m)"

VMware vmxnet3 TX Hang Detection:

The vmxnet3 patterns detect VMware virtual NIC transmit hangs that can cause cascade failures in storage systems, particularly in Longhorn environments. When a vmxnet3 TX hang occurs:

TX Hang (vmxnet3-tx-hang): The virtual NIC's transmit queue stalls
NIC Reset (vmxnet3-nic-reset): VMware attempts automatic recovery
Storage Impact: iSCSI/Longhorn connections may timeout during the outage
Soft Lockup (soft-lockup-storage): CPU may appear stuck in storage I/O

Cascade Timeline Example:

02:02:36 vmxnet3 0000:03:00.0 ens160: tx hang           # TX hang starts
02:02:36 vmxnet3 0000:03:00.0 ens160: resetting         # NIC reset begins
02:02:39 vmxnet3 0000:03:00.0: intr vectors allocated   # Recovery complete
02:07:03 soft lockup - CPU#2 stuck [longhorn-instan]    # Storage impact

Recommended Response:

Check ESXi host logs for underlying cause (vMotion, resource contention)
Review Longhorn/storage replica health
Consider increasing Longhorn timeouts in high-vmxnet3-hang environments
This is a VMware virtualization layer issue, not a guest OS or Kubernetes problem

Troubleshooting

Common Issues

1. Monitor Not Starting

Symptom: Monitor appears in config but doesn't generate status updates

Possible Causes:

Invalid configuration (check validation errors in logs)
Timeout exceeds interval
Missing dependencies (e.g., Kubernetes client for API server monitor)

Solution:

# Check logs for validation errors
journalctl -u node-doctor -f | grep -i error

# Verify configuration
node-doctor validate --config /etc/node-doctor/config.yaml

# Ensure timeout < interval
# timeout: 10s
# interval: 30s  # Good: interval > timeout

2. Permission Denied Errors

Symptom: Monitor fails with "permission denied" errors

Common Locations:

/dev/kmsg: Requires CAP_SYSLOG or root
/proc/diskstats: Requires read access
Container runtime sockets: Requires socket access

Solution:

# DaemonSet securityContext
securityContext:
  privileged: true  # Or specific capabilities
  capabilities:
    add:
      - SYS_ADMIN
      - SYS_RESOURCE

3. High Memory Usage

Symptom: Node Doctor consuming excessive memory

Possible Causes:

Too many log pattern monitors
Large deduplication windows
Too many custom patterns

Solution:

# Reduce deduplication window
dedupWindow: 1m  # Instead of 1h

# Limit patterns
maxEventsPerPattern: 5  # Instead of 100

# Reduce check frequency
interval: 60s  # Instead of 10s

4. Missing Events

Symptom: Expected events not appearing

Troubleshooting Steps:

Check Monitor Status:

# View monitor list
kubectl exec -it <pod> -- node-doctor monitors list

# Check specific monitor
kubectl exec -it <pod> -- node-doctor monitors status <monitor-name>

Verify Thresholds:

# Lower thresholds for testing
warningThreshold: 50  # Instead of 85
criticalThreshold: 75  # Instead of 95

Check Failure Threshold:

# Reduce to see events sooner
failureThreshold: 1  # Instead of 3

Enable Debug Logging:

# In config.yaml
logging:
  level: debug

5. Certificate/TLS Errors

Symptom: TLS handshake failures, certificate verification errors

Kubelet Monitor:

config:
  auth:
    type: certificate
    certPath: /path/to/client.crt
    keyPath: /path/to/client.key
    caPath: /path/to/ca.crt  # Ensure CA matches server cert

API Server Monitor:

# Verify ServiceAccount token is mounted
volumeMounts:
  - name: serviceaccount
    mountPath: /var/run/secrets/kubernetes.io/serviceaccount
    readOnly: true

6. ARM64 Specific Issues

Symptom: /dev/kmsg EINVAL errors on ARM64 nodes

Solution: This is expected behavior. The Memory Monitor handles this gracefully:

# OOM kill detection automatically disabled on ARM64 if /dev/kmsg fails
checkOOMKills: true  # Will gracefully skip on ARM64

Creating Custom Monitors

Step 1: Define Monitor Structure

Create a new package under pkg/monitors/:

package custom

import (
    "context"
    "github.com/supporttools/node-doctor/pkg/monitors"
    "github.com/supporttools/node-doctor/pkg/types"
)

type CustomMonitorConfig struct {
    // Your configuration fields
    Threshold float64 `json:"threshold"`
}

type CustomMonitor struct {
    *monitors.BaseMonitor
    config *CustomMonitorConfig
}

Step 2: Implement Factory Function

func NewCustomMonitor(ctx context.Context, config types.MonitorConfig) (types.Monitor, error) {
    // Validate configuration
    if err := ValidateCustomConfig(config); err != nil {
        return nil, fmt.Errorf("invalid configuration: %w", err)
    }

    // Parse custom configuration
    customConfig, err := parseCustomConfig(config.Config)
    if err != nil {
        return nil, fmt.Errorf("failed to parse config: %w", err)
    }

    // Create base monitor
    baseMonitor, err := monitors.NewBaseMonitor(config.Name, config.Interval, config.Timeout)
    if err != nil {
        return nil, fmt.Errorf("failed to create base monitor: %w", err)
    }

    // Create custom monitor
    monitor := &CustomMonitor{
        BaseMonitor: baseMonitor,
        config:      customConfig,
    }

    // Set check function
    if err := baseMonitor.SetCheckFunc(monitor.check); err != nil {
        return nil, fmt.Errorf("failed to set check function: %w", err)
    }

    return monitor, nil
}

Step 3: Implement Check Function

func (m *CustomMonitor) check(ctx context.Context) (*types.Status, error) {
    status := types.NewStatus(m.GetName())

    // Perform your health check logic here
    value := m.doHealthCheck(ctx)

    // Evaluate against threshold
    if value > m.config.Threshold {
        status.AddEvent(types.NewEvent(
            types.EventWarning,
            "ThresholdExceeded",
            fmt.Sprintf("Value %.2f exceeds threshold %.2f", value, m.config.Threshold),
        ))

        status.AddCondition(types.NewCondition(
            "Unhealthy",
            types.ConditionTrue,
            "ThresholdExceeded",
            fmt.Sprintf("Custom metric exceeded threshold"),
        ))
    }

    return status, nil
}

Step 4: Register Monitor

func init() {
    monitors.Register(monitors.MonitorInfo{
        Type:        "custom-mycheck",
        Factory:     NewCustomMonitor,
        Validator:   ValidateCustomConfig,
        Description: "Monitors custom metric",
    })
}

Step 5: Add Configuration Parsing

func parseCustomConfig(configMap map[string]interface{}) (*CustomMonitorConfig, error) {
    config := &CustomMonitorConfig{}

    if val, ok := configMap["threshold"]; ok {
        switch v := val.(type) {
        case float64:
            config.Threshold = v
        case int:
            config.Threshold = float64(v)
        default:
            return nil, fmt.Errorf("threshold must be a number")
        }
    } else {
        config.Threshold = 80.0 // Default
    }

    return config, nil
}

Step 6: Add Validation

func ValidateCustomConfig(config types.MonitorConfig) error {
    if config.Name == "" {
        return fmt.Errorf("monitor name is required")
    }

    if config.Type != "custom-mycheck" {
        return fmt.Errorf("invalid monitor type: %s", config.Type)
    }

    // Validate custom fields
    customConfig, err := parseCustomConfig(config.Config)
    if err != nil {
        return err
    }

    if customConfig.Threshold < 0 || customConfig.Threshold > 100 {
        return fmt.Errorf("threshold must be between 0 and 100")
    }

    return nil
}

Best Practices

Thread Safety: Use mutexes for shared state
Context Awareness: Respect context cancellation
Error Handling: Return meaningful errors, don't panic
Logging: Use structured logging for debugging
Testing: Write unit tests with mock dependencies
Documentation: Document configuration fields and defaults
Validation: Validate configuration early (fail fast)
Resource Cleanup: Clean up resources in stop logic
Failure Tracking: Implement consecutive failure thresholds
Recovery Events: Report when issues resolve

Testing Your Monitor

func TestCustomMonitor(t *testing.T) {
    config := types.MonitorConfig{
        Name:     "test-custom",
        Type:     "custom-mycheck",
        Interval: 30 * time.Second,
        Timeout:  5 * time.Second,
        Config: map[string]interface{}{
            "threshold": 75.0,
        },
    }

    monitor, err := NewCustomMonitor(context.Background(), config)
    if err != nil {
        t.Fatalf("Failed to create monitor: %v", err)
    }

    // Start monitor
    statusChan := monitor.Start()

    // Wait for first status
    select {
    case status := <-statusChan:
        // Verify status
        if status.Source != "test-custom" {
            t.Errorf("Expected source 'test-custom', got '%s'", status.Source)
        }
    case <-time.After(10 * time.Second):
        t.Fatal("Timeout waiting for status")
    }

    // Stop monitor
    monitor.Stop()
}

Summary

Node Doctor provides a comprehensive monitoring framework with:

12 Built-in Monitors covering system, network, Kubernetes, and custom checks
Pluggable Architecture for easy extension
Thread-Safe Design for production reliability
Failure Threshold Tracking to prevent false positives
Recovery Detection for automatic issue resolution reporting
Flexible Configuration via YAML with sensible defaults

For additional support, see:

FilesExpand file tree

monitors.md

Latest commit

History

monitors.md

File metadata and controls

Node Doctor Monitors Documentation

Table of Contents

Overview

Architecture

BaseMonitor Pattern

Monitor Registry

Status Reporting

System Monitors

CPU Monitor

Memory Monitor

Disk Monitor

Network Monitors

DNS Monitor

Basic Configuration

Enhanced Features

1. Per-Nameserver Domain Testing

2. Success Rate Tracking (Sliding Window)

3. DNS Error Type Classification

4. Consistency Checking (Rapid Multi-Query)

Complete Configuration Reference

Events Generated

Conditions

Example: Enterprise LDAP DNS Monitoring

Example Status Output

Best Practices for DNS Monitoring

Identifying Critical DNS Dependencies

Common DNS Issues in Kubernetes Clusters

Configuration Examples by Use Case

Remediation Guidance

Integration with Kubernetes Events

Gateway Monitor

Connectivity Monitor

CNI Monitor

Kubernetes Monitors

Kubelet Monitor

API Server Monitor

Runtime Monitor

Capacity Monitor

Custom Monitors

Plugin Monitor

Log Pattern Monitor

Troubleshooting

Common Issues

1. Monitor Not Starting

2. Permission Denied Errors

3. High Memory Usage

4. Missing Events

5. Certificate/TLS Errors

6. ARM64 Specific Issues

Creating Custom Monitors

Step 1: Define Monitor Structure

Step 2: Implement Factory Function

Step 3: Implement Check Function

Step 4: Register Monitor

Step 5: Add Configuration Parsing

Step 6: Add Validation

Best Practices

Testing Your Monitor

Summary