This document provides comprehensive information about all monitor types available in Node Doctor, their configuration options, and how to create custom monitors.
- Overview
- Architecture
- System Monitors
- Network Monitors
- Kubernetes Monitors
- Custom Monitors
- Troubleshooting
- Creating Custom Monitors
Node Doctor provides a comprehensive set of health monitors for Kubernetes nodes. Each monitor is designed to detect specific types of issues and report them through a unified status reporting system.
Key Features:
- Pluggable Architecture: Register monitors via the factory pattern
- Thread-Safe: All monitors use mutex protection for concurrent operations
- Context-Based Timeouts: Proper cancellation and timeout enforcement
- Failure Threshold Tracking: Prevent false positives from transient failures
- Recovery Detection: Automatically report when issues are resolved
- Configurable: Fine-tune behavior through YAML configuration
All monitors extend BaseMonitor which provides:
- Lifecycle management (Start/Stop)
- Periodic check scheduling
- Status channel management
- Graceful shutdown handling
type BaseMonitor struct {
name string
interval time.Duration
timeout time.Duration
statusChan chan *types.Status
checkFunc CheckFunc
logger Logger
}Monitors self-register at initialization using the registry pattern:
func init() {
monitors.Register(monitors.MonitorInfo{
Type: "system-cpu-check",
Factory: NewCPUMonitor,
Validator: ValidateCPUConfig,
Description: "Monitors CPU load average and thermal throttling",
})
}Monitors report health through types.Status:
- Events: Point-in-time occurrences (Info, Warning, Error)
- Conditions: Persistent states (True/False with Reason and Message)
Monitors CPU load average and thermal throttling on the system.
Monitor Type: system-cpu-check
Source File: pkg/monitors/system/cpu.go:584
Configuration:
monitors:
- name: cpu-health
type: system-cpu-check
interval: 30s
timeout: 5s
config:
warningLoadFactor: 0.8 # 80% of CPU cores (auto-validated: 0-100)
criticalLoadFactor: 1.5 # 150% of CPU cores (auto-validated: 0-100)
sustainedHighLoadChecks: 3 # Consecutive checks before alerting
checkThermalThrottle: true
checkLoadAverage: trueDefault Values:
warningLoadFactor: 0.8 (80%)criticalLoadFactor: 1.5 (150%)sustainedHighLoadChecks: 3checkThermalThrottle: truecheckLoadAverage: true
Note: Threshold values (warningLoadFactor, criticalLoadFactor) are automatically validated to be in the range 0-100. See Configuration Guide for details.
Key Features:
- Reads
/proc/loadavgfor 1-minute load average - Normalizes load by CPU core count
- Sustained high load tracking (prevents alert spam from transient spikes)
- Thermal throttling detection from
/sys/devices/system/cpu/cpu*/thermal_throttle/core_throttle_count - Recovery event generation
Events Generated:
HighCPULoad(Warning/Error): CPU load exceeds thresholdsCPUThermalThrottle(Warning): Thermal throttling detectedCPULoadRecovered(Info): CPU load returned to normal
Conditions:
HighCPULoad: True when sustained high load detected
Example Status:
events:
- severity: Warning
reason: HighCPULoad
message: "CPU load at 120% (1.5 load factor), exceeds 80% warning threshold"
conditions:
- type: HighCPULoad
status: "True"
reason: SustainedHighLoad
message: "CPU load has been high for 3 consecutive checks"Monitors memory usage, swap usage, and OOM (Out-Of-Memory) kill events.
Monitor Type: system-memory-check
Source File: pkg/monitors/system/memory.go:653
Configuration:
monitors:
- name: memory-health
type: system-memory-check
interval: 30s
timeout: 5s
config:
warningThreshold: 85 # 85% memory usage (auto-validated: 0-100)
criticalThreshold: 95 # 95% memory usage (auto-validated: 0-100)
swapWarningThreshold: 50 # 50% swap usage (auto-validated: 0-100)
swapCriticalThreshold: 80 # 80% swap usage (auto-validated: 0-100)
sustainedHighMemoryChecks: 3
checkOOMKills: trueDefault Values:
warningThreshold: 85%criticalThreshold: 95%swapWarningThreshold: 50%swapCriticalThreshold: 80%sustainedHighMemoryChecks: 3checkOOMKills: true
Note: Threshold values are automatically validated to be in the range 0-100. See Configuration Guide for details.
Key Features:
- Reads
/proc/meminfofor memory statistics - Monitors
/dev/kmsgfor OOM kill events - ARM64 compatibility (handles
/dev/kmsgEINVAL gracefully) - Sustained high memory tracking
- Swap usage monitoring
- OOM kill detection with process details
Events Generated:
HighMemoryUsage(Warning/Error): Memory usage exceeds thresholdsHighSwapUsage(Warning/Error): Swap usage exceeds thresholdsOOMKillDetected(Error): OOM killer has terminated a processMemoryRecovered(Info): Memory usage returned to normal
Conditions:
HighMemoryUsage: True when sustained high memory detected
Special Notes:
- ARM64 /dev/kmsg Issue: On ARM64 systems,
/dev/kmsgmay return EINVAL errors. The monitor handles this gracefully and continues without OOM kill detection.
Example Status:
events:
- severity: Error
reason: OOMKillDetected
message: "OOM killer terminated process 'node-doctor' (PID 12345)"
- severity: Warning
reason: HighMemoryUsage
message: "Memory usage at 87.5%, exceeds 85% warning threshold"Monitors disk space, inode usage, read-only filesystems, and I/O health across multiple mount points.
Monitor Type: system-disk-check
Source File: pkg/monitors/system/disk.go:870
Configuration:
monitors:
- name: disk-health
type: system-disk-check
interval: 60s
timeout: 10s
config:
mountPoints:
- path: /
warningThreshold: 85
criticalThreshold: 95
inodeWarningThreshold: 85
inodeCriticalThreshold: 95
- path: /var/lib/kubelet
warningThreshold: 90
criticalThreshold: 95
checkReadOnlyFilesystem: true
checkIOHealth: true
ioUtilizationWarning: 80 # 80% I/O utilization
ioUtilizationCritical: 95 # 95% I/O utilizationDefault Values per Mount Point:
warningThreshold: 85%criticalThreshold: 95%inodeWarningThreshold: 85%inodeCriticalThreshold: 95%
Note: All threshold values (warningThreshold, criticalThreshold, inodeWarningThreshold, inodeCriticalThreshold, ioUtilizationWarning, ioUtilizationCritical) are automatically validated to be in the range 0-100. See Configuration Guide for details.
Key Features:
- Multi-mount point support with individual thresholds
- Space and inode monitoring via
syscall.Statfs() - Read-only filesystem detection (remount attempts)
- I/O health monitoring via
/proc/diskstats - I/O utilization tracking (percent time device is busy)
- Per-device metrics aggregation
Events Generated:
HighDiskUsage(Warning/Error): Disk space exceeds thresholdsHighInodeUsage(Warning/Error): Inode usage exceeds thresholdsReadOnlyFilesystem(Error): Filesystem is mounted read-onlyHighIOUtilization(Warning/Error): I/O utilization exceeds thresholdsDiskRecovered(Info): Disk usage returned to normal
Conditions:
HighDiskUsage: True when disk space criticalReadOnlyFilesystem: True when filesystem is read-only
Example Status:
events:
- severity: Error
reason: HighDiskUsage
message: "Disk usage at /: 96.2% exceeds 95% critical threshold"
- severity: Warning
reason: HighInodeUsage
message: "Inode usage at /var/lib/kubelet: 87.3% exceeds 85% warning threshold"Comprehensive DNS resolution monitoring for cluster-internal, external, and custom domains with advanced intermittent failure detection capabilities.
Monitor Type: network-dns-check
Source File: pkg/monitors/network/dns.go
monitors:
- name: dns-health
type: network-dns-check
interval: 30s
timeout: 10s
config:
clusterDomains:
- kubernetes.default.svc.cluster.local
- kube-dns.kube-system.svc.cluster.local
externalDomains:
- google.com
- cloudflare.com
customQueries:
- domain: api.example.com
recordType: A
- domain: ldap.internal.corp
recordType: A
testEachNameserver: true # Test against each DNS server
- domain: critical-service.internal
recordType: A
consistencyCheck: true # Enable rapid consistency checks
latencyThreshold: 1s
nameserverCheckEnabled: true
failureCountThreshold: 3
resolverPath: /etc/resolv.confDefault Values:
clusterDomains: ["kubernetes.default.svc.cluster.local"]externalDomains: ["google.com", "cloudflare.com"]latencyThreshold: 1 secondnameserverCheckEnabled: falsefailureCountThreshold: 3resolverPath: /etc/resolv.conf
Test a specific domain against each configured nameserver individually to identify which DNS server is causing issues.
config:
customQueries:
- domain: "critical-service.internal.corp"
recordType: "A"
testEachNameserver: true # Test against each nameserver in /etc/resolv.confUse Cases:
- Identify which upstream DNS server is failing
- Detect intermittent issues caused by DNS server round-robin
- Troubleshoot split-horizon DNS problems
Generated Conditions:
CustomDNSDown: All nameservers failed to resolve the domainDNSResolutionDegraded: Some nameservers working, others failingCustomDNSHealthy: All nameservers responding successfully
Track DNS resolution success rate over a sliding window to detect intermittent failures that consecutive-failure tracking misses.
config:
successRateTracking:
enabled: true
windowSize: 10 # Track last 10 checks
failureRateThreshold: 30 # Alert if >30% failures (accepts 0-100 or 0.0-1.0)
minSamplesRequired: 5 # Need at least 5 samples before alertingDefault Values:
enabled: false (disabled by default)windowSize: 10failureRateThreshold: 0.3 (30%)minSamplesRequired: 5
Problem Solved:
Traditional consecutive failure tracking misses intermittent issues:
Check 1: SUCCESS
Check 2: FAIL
Check 3: SUCCESS
Check 4: FAIL
Check 5: FAIL
Check 6: SUCCESS
Check 7: FAIL
With failureCountThreshold: 3, no alert fires because consecutive failures never reach 3. But 57% of checks are failing!
Generated Conditions:
ClusterDNSDegraded: Cluster DNS failure rate exceeds thresholdClusterDNSIntermittent: Some cluster DNS failures (below threshold)ExternalDNSDegraded: External DNS failure rate exceeds thresholdExternalDNSIntermittent: Some external DNS failures (below threshold)
Automatically classifies DNS errors to help identify root causes faster.
Error Types:
| Type | Indicates | Go Error Patterns |
|---|---|---|
Timeout |
Network issues, server overload, firewall | i/o timeout, context deadline exceeded |
NXDOMAIN |
Domain doesn't exist, typo, missing record | no such host, DNSError.IsNotFound |
SERVFAIL |
Upstream DNS error, DNSSEC validation failure | server misbehaving |
Refused |
DNS server down, wrong port, firewall | connection refused |
Temporary |
Transient failure, retry may succeed | DNSError.IsTemporary |
Unknown |
Unclassified error | Other errors |
Example Event:
events:
- severity: Error
reason: CustomDNSQueryFailed
message: "[NXDOMAIN] Failed to resolve custom domain ldap.corp.internal: no such host"Perform multiple rapid DNS queries to detect intermittent DNS issues that single queries miss.
config:
consistencyChecking:
enabled: true
queriesPerCheck: 5 # Make 5 rapid queries (range: 2-20)
intervalBetweenQueries: 200ms # Delay between queries (range: 10ms-5s)
customQueries:
- domain: "critical-service.internal.corp"
consistencyCheck: true # Enable for this specific queryDefault Values:
enabled: false (disabled by default)queriesPerCheck: 5intervalBetweenQueries: 200ms
Detection Capabilities:
- All queries succeed with same IPs →
DNSResolutionConsistent - All queries fail →
DNSResolutionDown - Some succeed, some fail →
DNSResolutionIntermittent - All succeed but different IPs →
DNSResolutionInconsistent
Example Status:
events:
- severity: Warning
reason: ConsistencyCheckIntermittent
message: "[Timeout] Intermittent DNS resolution for critical-service.internal.corp: 3/5 queries succeeded (avg latency: 150ms)"
conditions:
- type: DNSResolutionIntermittent
status: "True"
reason: IntermittentConsistencyFailures
message: "Domain critical-service.internal.corp: 3/5 queries succeeded (60.0% success rate)"monitors:
- name: dns-health
type: network-dns-check
interval: 30s
timeout: 10s
config:
# Basic domain testing
clusterDomains:
- kubernetes.default.svc.cluster.local
externalDomains:
- google.com
- cloudflare.com
# Custom domain queries with per-query options
customQueries:
- domain: "api.example.com"
recordType: "A" # Currently only "A" supported
- domain: "ldap.internal.corp"
recordType: "A"
testEachNameserver: true # Test each nameserver individually
- domain: "critical-service.internal"
recordType: "A"
consistencyCheck: true # Enable consistency checking
# Thresholds
latencyThreshold: 1s # Max acceptable query latency
failureCountThreshold: 3 # Consecutive failures before alert
# Nameserver configuration
nameserverCheckEnabled: true # Check nameserver reachability
resolverPath: /etc/resolv.conf # Path to resolver config
# Success rate tracking (sliding window)
successRateTracking:
enabled: true
windowSize: 10
failureRateThreshold: 30 # 30% (accepts 0-100 or 0.0-1.0)
minSamplesRequired: 5
# Consistency checking (rapid multi-query)
consistencyChecking:
enabled: true
queriesPerCheck: 5 # 2-20 queries per check
intervalBetweenQueries: 200ms # 10ms-5s between queries| Event | Severity | Description |
|---|---|---|
ClusterDNSResolutionFailed |
Error | Cluster DNS query failed |
ExternalDNSResolutionFailed |
Error | External DNS query failed |
CustomDNSQueryFailed |
Error | Custom domain query failed |
ClusterDNSNoRecords |
Warning | No A records for cluster domain |
ExternalDNSNoRecords |
Warning | No A records for external domain |
CustomDNSNoRecords |
Warning | No A records for custom domain |
HighClusterDNSLatency |
Warning | Cluster DNS latency exceeds threshold |
HighExternalDNSLatency |
Warning | External DNS latency exceeds threshold |
HighCustomDNSLatency |
Warning | Custom DNS latency exceeds threshold |
NameserverUnreachable |
Warning | Nameserver failed to respond |
NameserverDomainResolutionFailed |
Warning | Specific nameserver failed for domain |
NameserverDomainLatencyHigh |
Warning | Nameserver latency exceeds threshold |
ConsistencyCheckIntermittent |
Warning | Some consistency check queries failed |
ConsistencyCheckAllFailed |
Error | All consistency check queries failed |
ConsistencyCheckIPVariation |
Warning | Varying IPs across queries |
ConsistencyCheckHighLatency |
Warning | Average latency exceeds threshold |
ConsistencyCheckCancelled |
Warning | Check cancelled (context timeout) |
UnsupportedQueryType |
Warning | Record type other than A requested |
ResolverConfigParseError |
Warning | Failed to parse /etc/resolv.conf |
NoNameserversConfigured |
Warning | No nameservers found in config |
Node conditions are prefixed with NodeDoctor when exported to Kubernetes:
| Condition (as shown in kubectl) | Description |
|---|---|
NodeDoctorClusterDNSDown |
Cluster DNS failed repeatedly |
NodeDoctorClusterDNSHealthy |
Cluster DNS resolution is healthy |
NodeDoctorNetworkUnreachable |
External DNS failed repeatedly |
NodeDoctorNetworkReachable |
External DNS resolution is healthy |
NodeDoctorClusterDNSDegraded |
Cluster DNS failure rate exceeds threshold |
NodeDoctorClusterDNSIntermittent |
Cluster DNS has intermittent failures |
NodeDoctorExternalDNSDegraded |
External DNS failure rate exceeds threshold |
NodeDoctorExternalDNSIntermittent |
External DNS has intermittent failures |
NodeDoctorCustomDNSDown |
All nameservers failed for custom domain |
NodeDoctorCustomDNSHealthy |
All nameservers healthy for custom domain |
NodeDoctorDNSResolutionDegraded |
Partial nameserver failure for domain |
NodeDoctorDNSResolutionConsistent |
Consistency check: all queries consistent |
NodeDoctorDNSResolutionIntermittent |
Consistency check: some queries failed |
NodeDoctorDNSResolutionInconsistent |
Consistency check: varying IP addresses (expected for CDN/load-balanced domains) |
NodeDoctorDNSResolutionDown |
Consistency check: all queries failed |
Querying DNS Conditions:
# View all DNS conditions for a node
kubectl describe node <node-name> | grep -E "NodeDoctorDNS|NodeDoctorCluster|NodeDoctorCustomDNS"
# Example output:
# NodeDoctorDNSResolutionConsistent True ConsistentResolution Domain google.com: all 5 queries succeeded with consistent IPs
# NodeDoctorCustomDNSHealthy True AllNameserversHealthy Domain cloudflare.com: all 1 nameservers responding
# NodeDoctorClusterDNSHealthy True ClusterDNSResolved Cluster DNS resolution is healthymonitors:
- name: ldap-dns-critical
type: network-dns-check
interval: 30s
timeout: 10s
config:
clusterDomains: [] # Disable cluster DNS checks
externalDomains: [] # Disable external DNS checks
customQueries:
- domain: "ldap.corp.internal"
recordType: "A"
testEachNameserver: true # Identify failing DNS servers
consistencyCheck: true # Detect intermittent failures
- domain: "auth.corp.internal"
recordType: "A"
testEachNameserver: true
latencyThreshold: 500ms # Stricter latency for auth services
nameserverCheckEnabled: true
failureCountThreshold: 2 # Alert faster for critical domains
successRateTracking:
enabled: true
windowSize: 20 # Track more samples
failureRateThreshold: 10 # 10% failure rate threshold
minSamplesRequired: 5
consistencyChecking:
enabled: true
queriesPerCheck: 10 # More queries for higher confidence
intervalBetweenQueries: 100ms # Faster queriesevents:
- severity: Warning
reason: NameserverDomainResolutionFailed
message: "[Timeout] Nameserver 10.154.57.53 failed to resolve ldap.corp.internal: i/o timeout"
- severity: Warning
reason: ConsistencyCheckIntermittent
message: "[Timeout] Intermittent DNS resolution for ldap.corp.internal: 8/10 queries succeeded (avg latency: 120ms)"
conditions:
- type: DNSResolutionDegraded
status: "True"
reason: PartialNameserverFailure
message: "Domain ldap.corp.internal: 2/3 nameservers responding (failed: 10.154.57.53)"
- type: DNSResolutionIntermittent
status: "True"
reason: IntermittentConsistencyFailures
message: "Domain ldap.corp.internal: 8/10 queries succeeded (80.0% success rate)"Before configuring DNS monitoring, identify domains critical to your infrastructure:
| Dependency Type | Examples | Risk Level |
|---|---|---|
| Authentication | LDAP/AD servers, OAuth providers | Critical - auth failures block users |
| Databases | PostgreSQL, MySQL, MongoDB hostnames | Critical - app failures |
| Service Mesh | Consul, Istio service discovery | High - service routing failures |
| External APIs | Payment gateways, third-party services | High - feature degradation |
| Container Registries | gcr.io, docker.io, custom registries | Medium - deployment failures |
| Cluster Services | kubernetes.default.svc.cluster.local | Critical - pod communication |
| Issue | Symptoms | Detection Method |
|---|---|---|
| Upstream DNS overload | Intermittent timeouts in large clusters (100+ nodes) | Success rate tracking, consistency checking |
| Custom TLD misconfiguration | NXDOMAIN for .local, .internal, .test domains | Error type classification shows NXDOMAIN |
| Split-horizon DNS | Different results from different nameservers | Per-nameserver testing |
| CoreDNS pod failures | Cluster DNS fails, external works | Compare clusterDomains vs externalDomains results |
| DNS cache TTL issues | Stale IPs after service migration | Consistency checking shows IP variation |
| Network policy blocking | Timeout to specific nameservers | Per-nameserver testing with error classification |
1. Basic External Connectivity Check:
monitors:
- name: external-dns
type: network-dns-check
interval: 60s
config:
clusterDomains: [] # Skip cluster DNS
externalDomains:
- google.com
- cloudflare.com
latencyThreshold: 2s
failureCountThreshold: 32. Custom TLD Monitoring (.internal, .local):
monitors:
- name: internal-dns
type: network-dns-check
interval: 30s
config:
clusterDomains: []
externalDomains: []
customQueries:
- domain: "app.internal.corp"
recordType: "A"
testEachNameserver: true # Find which DNS server fails
- domain: "db.internal.corp"
recordType: "A"
testEachNameserver: true
nameserverCheckEnabled: true
failureCountThreshold: 23. High-Availability DNS Validation:
monitors:
- name: ha-dns-check
type: network-dns-check
interval: 15s # More frequent checks
config:
customQueries:
- domain: "api-gateway.prod.svc.cluster.local"
consistencyCheck: true # Detect intermittent failures
testEachNameserver: true # Check all DNS servers
successRateTracking:
enabled: true
windowSize: 20
failureRateThreshold: 5 # 5% threshold for critical services
minSamplesRequired: 10
consistencyChecking:
enabled: true
queriesPerCheck: 10
intervalBetweenQueries: 50ms # Aggressive testing4. Database Hostname Monitoring:
monitors:
- name: database-dns
type: network-dns-check
interval: 30s
config:
customQueries:
- domain: "postgres-primary.db.svc.cluster.local"
recordType: "A"
consistencyCheck: true
- domain: "postgres-replica.db.svc.cluster.local"
recordType: "A"
consistencyCheck: true
- domain: "redis-master.cache.svc.cluster.local"
recordType: "A"
latencyThreshold: 100ms # Low latency for DB connections
failureCountThreshold: 1 # Alert immediately
consistencyChecking:
enabled: true
queriesPerCheck: 5When DNS issues are detected, consider these remediation steps:
| Condition | Cause | Remediation |
|---|---|---|
ClusterDNSDown |
CoreDNS pods unhealthy | Check kubectl -n kube-system get pods -l k8s-app=kube-dns |
DNSResolutionDegraded (partial nameserver failure) |
One upstream DNS server failing | Update /etc/resolv.conf or CoreDNS upstream servers |
DNSResolutionIntermittent |
Overloaded DNS servers | Increase CoreDNS replicas, enable DNS caching |
| NXDOMAIN errors | Missing DNS record or zone | Add record to DNS zone, check CoreDNS stub domains |
| High latency | Network congestion, distant DNS | Use local caching DNS, reduce TTL for faster updates |
DNSResolutionInconsistent (varying IPs) |
DNS load balancing, stale cache | Verify expected behavior, check TTL settings |
CoreDNS Stub Domain Configuration:
If custom TLDs (.internal, .corp) are failing, configure CoreDNS stub domains:
apiVersion: v1
kind: ConfigMap
metadata:
name: coredns
namespace: kube-system
data:
Corefile: |
.:53 {
errors
health
kubernetes cluster.local in-addr.arpa ip6.arpa {
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
prometheus :9153
forward . /etc/resolv.conf
cache 30
loop
reload
loadbalance
}
corp.internal:53 {
errors
cache 30
forward . 10.0.0.53 10.0.0.54 # Internal DNS servers
}Temporary /etc/hosts Workaround:
For immediate mitigation while DNS is fixed:
apiVersion: v1
kind: Pod
spec:
hostAliases:
- ip: "10.0.1.100"
hostnames:
- "ldap.corp.internal"
- ip: "10.0.1.101"
hostnames:
- "auth.corp.internal"DNS conditions appear as node conditions viewable via kubectl:
# View all DNS-related node conditions
kubectl describe node <node-name> | grep -E "NodeDoctorDNS|NodeDoctorCluster|NodeDoctorCustomDNS"
# Example output from a healthy node:
# NodeDoctorDNSResolutionConsistent True ConsistentResolution Domain google.com: all 5 queries succeeded with consistent IPs 142.251.32.46 (avg latency: 4.65ms)
# NodeDoctorCustomDNSHealthy True AllNameserversHealthy Domain cloudflare.com: all 1 nameservers responding
# NodeDoctorClusterDNSHealthy True ClusterDNSResolved Cluster DNS resolution is healthy
# NodeDoctorDNSResolutionInconsistent True InconsistentIPAddresses Domain google.com: 3 unique IPs returned across 5 queries
# Check DNS conditions across all nodes
kubectl get nodes -o name | xargs -I {} sh -c 'echo "=== {} ===" && kubectl describe {} | grep -E "NodeDoctorDNS|NodeDoctorCluster|NodeDoctorCustomDNS"'Alerting with Prometheus:
Node Doctor exports DNS metrics that can be used for alerting:
# Example Prometheus alert rules (using kube-state-metrics node conditions)
groups:
- name: dns-alerts
rules:
- alert: DNSResolutionIntermittent
expr: |
kube_node_status_condition{condition="NodeDoctorDNSResolutionIntermittent", status="true"} == 1
for: 5m
labels:
severity: warning
annotations:
summary: "Intermittent DNS resolution on {{ $labels.node }}"
description: "DNS resolution is intermittent, indicating upstream DNS issues"
- alert: DNSResolutionDegraded
expr: |
kube_node_status_condition{condition="NodeDoctorDNSResolutionDegraded", status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Degraded DNS resolution on {{ $labels.node }}"
description: "One or more DNS servers are failing for critical domains"
- alert: ClusterDNSDown
expr: |
kube_node_status_condition{condition="NodeDoctorClusterDNSDown", status="true"} == 1
for: 1m
labels:
severity: critical
annotations:
summary: "Cluster DNS is down on {{ $labels.node }}"
description: "Cluster DNS resolution has repeatedly failed"
- alert: CustomDNSUnhealthy
expr: |
kube_node_status_condition{condition="NodeDoctorCustomDNSDown", status="true"} == 1
for: 2m
labels:
severity: critical
annotations:
summary: "Custom DNS domains failing on {{ $labels.node }}"
description: "All nameservers are failing for custom domain queries"Complete Testing Configuration:
This configuration enables all DNS monitoring features for comprehensive testing:
monitors:
- name: dns-health
type: network-dns-check
enabled: true
interval: 15s # More frequent for testing
timeout: 10s
config:
# Standard cluster DNS domains
clusterDomains:
- kubernetes.default.svc.cluster.local
- kube-dns.kube-system.svc.cluster.local
# External DNS domains
externalDomains:
- google.com
- cloudflare.com
# Custom queries with all features enabled
customQueries:
# Test per-nameserver checking
- domain: "kubernetes.default.svc.cluster.local"
recordType: "A"
testEachNameserver: true
# Test consistency checking
- domain: "google.com"
recordType: "A"
consistencyCheck: true
# Test both features together
- domain: "cloudflare.com"
recordType: "A"
testEachNameserver: true
consistencyCheck: true
latencyThreshold: 1s
nameserverCheckEnabled: true
failureCountThreshold: 2
# Success rate tracking - sliding window
successRateTracking:
enabled: true
windowSize: 10
failureRateThreshold: 20 # 20% failure rate threshold
minSamplesRequired: 3
# Consistency checking - rapid multi-query
consistencyChecking:
enabled: true
queriesPerCheck: 5
intervalBetweenQueries: 100msExpected Conditions When Testing:
| Condition | Expected Value | Notes |
|---|---|---|
NodeDoctorClusterDNSHealthy |
True | Cluster DNS should resolve |
NodeDoctorCustomDNSHealthy |
True | Per-nameserver tests passing |
NodeDoctorDNSResolutionConsistent |
True | Queries returning consistent results |
NodeDoctorDNSResolutionInconsistent |
True (for google.com) | Expected - Google uses DNS round-robin, multiple IPs is normal |
Note:
NodeDoctorDNSResolutionInconsistent=Truefor domains like google.com is expected behavior. Large CDN/cloud providers use DNS load balancing which returns different IPs. This condition helps detect unexpected IP variation in domains where you expect a single IP.
SLO/SLA Tracking:
Use success rate tracking for DNS SLOs:
# Configuration for 99.9% DNS availability SLO
successRateTracking:
enabled: true
windowSize: 100 # Track 100 checks
failureRateThreshold: 0.1 # 0.1% = 99.9% availability
minSamplesRequired: 50 # Need 50 samples before alertingMonitors default gateway reachability via ICMP ping.
Monitor Type: network-gateway-check
Source File: pkg/monitors/network/gateway.go:300
Configuration:
monitors:
- name: gateway-health
type: network-gateway-check
interval: 30s
timeout: 5s
config:
autoDetectGateway: true # Auto-detect from /proc/net/route
manualGateway: "" # Override with specific IP
pingCount: 3
pingTimeout: 1s
latencyThreshold: 100ms
failureCountThreshold: 3Default Values:
autoDetectGateway: truepingCount: 3pingTimeout: 1 secondlatencyThreshold: 100msfailureCountThreshold: 3
Key Features:
- Auto-detection of default gateway from
/proc/net/route - ICMP echo request/reply (ping)
- Average latency calculation
- Packet loss detection
- Manual gateway override option
Events Generated:
GatewayUnreachable(Error): Cannot reach default gatewayGatewayHighLatency(Warning): Gateway latency exceeds thresholdGatewayRecovered(Info): Gateway reachability restored
Conditions:
GatewayUnreachable: True when failure threshold exceeded
Example Status:
events:
- severity: Error
reason: GatewayUnreachable
message: "Default gateway 192.168.1.1 is unreachable (0/3 packets received)"
- severity: Warning
reason: GatewayHighLatency
message: "Gateway latency 150ms exceeds 100ms threshold"Monitors HTTP/HTTPS endpoint connectivity for external services and APIs.
Monitor Type: network-connectivity-check
Source File: pkg/monitors/network/connectivity.go:300
Configuration:
monitors:
- name: connectivity-health
type: network-connectivity-check
interval: 60s
timeout: 30s
config:
endpoints:
- url: https://kubernetes.default.svc.cluster.local
name: kubernetes-api
method: GET
expectedStatusCode: 200
timeout: 10s
followRedirects: false
headers:
Authorization: "Bearer ${TOKEN}"
- url: https://registry.example.com/v2/
name: container-registry
method: HEAD
expectedStatusCode: 200Default Values per Endpoint:
method: HEAD (safe, no response body)expectedStatusCode: 200timeout: 10 secondsfollowRedirects: false
Key Features:
- Multiple endpoint monitoring
- HTTP method support: GET, HEAD, OPTIONS (POST/PUT/DELETE restricted for safety)
- Custom headers (e.g., authentication tokens)
- Expected status code validation
- URL and protocol validation (http/https only)
- Resource limits: maximum 50 endpoints
Events Generated:
EndpointUnreachable(Error): HTTP request failedEndpointUnexpectedStatus(Warning): Status code mismatchEndpointRecovered(Info): Endpoint reachable again
Security Considerations:
- Only safe HTTP methods allowed (GET, HEAD, OPTIONS)
- No POST/PUT/DELETE to prevent accidental data modification
- URL validation prevents file:// and other protocols
- Header sanitization in error messages
Example Status:
events:
- severity: Error
reason: EndpointUnreachable
message: "Failed to reach https://registry.example.com: connection timeout"
- severity: Warning
reason: EndpointUnexpectedStatus
message: "Endpoint kubernetes-api returned 401, expected 200"Monitors CNI (Container Network Interface) health and cross-node network connectivity. This monitor is critical for detecting network partitions and ensuring nodes can communicate with each other in the cluster.
Monitor Type: network-cni-check
Source Files:
pkg/monitors/network/cni.gopkg/monitors/network/cni_health.gopkg/monitors/network/peer_discovery.go
Configuration:
monitors:
- name: cni-health
type: network-cni-check
interval: 30s
timeout: 15s
config:
discovery:
method: kubernetes # kubernetes or static
namespace: node-doctor # Namespace for peer discovery
labelSelector: app=node-doctor
refreshInterval: 5m # How often to refresh peer list
staticPeers: [] # For static method: list of IPs
connectivity:
pingCount: 3 # Pings per peer
pingTimeout: 5s # Timeout per ping
warningLatency: 50ms # Latency warning threshold
criticalLatency: 200ms # Latency critical threshold
failureThreshold: 3 # Consecutive failures before alert
minReachablePeers: 80 # Minimum % of peers that must be reachable
cniHealth:
enabled: true # Enable CNI config/interface checks
configPath: /etc/cni/net.d # CNI configuration directory
checkInterfaces: false # Check for specific interfaces
expectedInterfaces: [] # Expected interface namesDefault Values:
Discovery:
method: kubernetesnamespace: node-doctorlabelSelector: app=node-doctorrefreshInterval: 5 minutes
Connectivity:
pingCount: 3pingTimeout: 5 secondswarningLatency: 50mscriticalLatency: 200msfailureThreshold: 3minReachablePeers: 80%
CNI Health:
enabled: trueconfigPath: /etc/cni/net.dcheckInterfaces: false
Key Features:
-
Peer Discovery
- Kubernetes API-based discovery of other node-doctor instances
- Automatic exclusion of self (current node)
- Background refresh of peer list
- Static peer configuration for non-Kubernetes environments
-
Cross-Node Connectivity
- ICMP ping mesh to all discovered peers
- Latency measurement with configurable thresholds
- Network partition detection when peer reachability drops below threshold
- Per-peer failure tracking with consecutive failure counts
-
CNI Health Validation
- CNI configuration file detection and validation
- Support for multiple CNI plugins (Calico, Flannel, Weave, Cilium, etc.)
- Network interface discovery and health checking
- Expected interface validation
-
Network Partition Detection
- Reports
NetworkPartitionedcondition when insufficient peers reachable - Configurable minimum reachable percentage threshold
- Automatic recovery detection
- Reports
Discovery Methods:
-
Kubernetes Discovery (default)
- Uses Kubernetes API to list pods matching label selector
- Requires RBAC permissions for pod list/watch
- Automatically discovers node-doctor instances on other nodes
- Uses host network IPs for connectivity testing
-
Static Discovery
- Manual list of peer IP addresses
- Useful for testing or non-Kubernetes environments
- No Kubernetes API dependency
CNI Plugin Detection:
The monitor automatically detects common CNI plugins:
- Calico:
calico-*interfaces,10-calico.conflist - Flannel:
flannel.1interface,10-flannel.conflist - Weave:
weave*interfaces - Cilium:
cilium_host,cilium_net,lxc*interfaces - Canal: Combined Calico/Flannel
Events Generated:
NoPeersFound(Warning): No peer instances discoveredPeerUnreachable(Error): Peer node unreachable after failure thresholdHighPeerLatency(Warning): Latency to peer exceeds thresholdCNIConnectivitySummary(Info): Summary of peer connectivityCNIConfigError(Error): CNI configuration validation failedCNIInterfaceWarning(Warning): Expected interfaces not found
Conditions:
NetworkPartitioned: True when insufficient peers reachableNetworkDegraded: True when high latency detected to peersCNIHealthy: Overall CNI plugin health statusCNIConfigValid: CNI configuration file validityCNIInterfacesHealthy: Expected network interfaces present
Example Status:
events:
- severity: Info
reason: CNIConnectivitySummary
message: "Peer connectivity: 4/5 reachable (80%), avg latency=12.50ms"
- severity: Warning
reason: HighPeerLatency
message: "High latency to peer node-3: 250.00ms (critical threshold: 200.00ms)"
conditions:
- type: NetworkPartitioned
status: "False"
reason: SufficientPeerReachability
message: "80% of peers are reachable (4/5)"
- type: NetworkDegraded
status: "True"
reason: HighLatencyDetected
message: "High latency detected to 1 peers: [node-3 (250.00ms)]"
- type: CNIHealthy
status: "True"
reason: CNIOperational
message: "CNI plugin configuration and interfaces are healthy"Network Partition Example:
events:
- severity: Error
reason: PeerUnreachable
message: "Peer node-2 has been unreachable for 3 consecutive checks"
- severity: Error
reason: PeerUnreachable
message: "Peer node-3 has been unreachable for 5 consecutive checks"
conditions:
- type: NetworkPartitioned
status: "True"
reason: InsufficientPeerReachability
message: "Only 40% of peers are reachable (threshold: 80%). Unreachable: [node-2, node-3]"RBAC Requirements:
The CNI monitor requires pod list permissions for Kubernetes-based peer discovery:
rules:
- apiGroups: [""]
resources: ["pods"]
verbs: ["get", "list", "watch"]Architecture Notes:
Since node-doctor runs with hostNetwork: true, the CNI monitor:
- Uses node IPs (not pod IPs) for connectivity testing
- Can detect node-level network issues that affect CNI
- Works even if CNI is completely broken (since it bypasses pod network)
- Tests actual node-to-node connectivity that CNI relies on
Use Cases:
- Network Partition Detection: Detect when a node becomes isolated from the cluster
- CNI Health Monitoring: Validate CNI plugin is properly configured
- Network Latency Alerting: Identify network performance degradation
- Cross-Node Connectivity: Ensure nodes can communicate for pod-to-pod traffic
- Node Health Correlation: A node that can't reach other nodes should be considered unhealthy
Monitors Kubelet health, metrics, and PLEG (Pod Lifecycle Event Generator) performance.
Monitor Type: kubernetes-kubelet-check
Source File: pkg/monitors/kubernetes/kubelet.go:350
Configuration:
monitors:
- name: kubelet-health
type: kubernetes-kubelet-check
interval: 30s
timeout: 10s
config:
healthzURL: http://127.0.0.1:10248/healthz
metricsURL: http://127.0.0.1:10250/metrics
checkPLEG: true
plegThreshold: 5s # PLEG relist duration threshold
failureThreshold: 3
auth:
type: serviceaccount # none, serviceaccount, bearer, certificate
tokenPath: /var/run/secrets/kubernetes.io/serviceaccount/token
certPath: /path/to/client.crt
keyPath: /path/to/client.key
caPath: /path/to/ca.crt
circuitBreaker:
enabled: true
threshold: 5 # Open after 5 consecutive failures
timeout: 30s # Half-open after 30s
maxHalfOpenRequests: 3Default Values:
healthzURL: http://127.0.0.1:10248/healthzmetricsURL: http://127.0.0.1:10250/metricscheckPLEG: trueplegThreshold: 5 secondsfailureThreshold: 3auth.type: none
Authentication Types:
none: No authentication (healthz endpoint)serviceaccount: Use mounted ServiceAccount tokenbearer: Custom bearer tokencertificate: mTLS with client certificate
Circuit Breaker Pattern:
- Closed: Normal operation, requests flow through
- Open: Too many failures, fail fast without requests
- Half-Open: Testing if service recovered, limited requests
Key Features:
- Kubelet /healthz endpoint monitoring
- Metrics endpoint authentication support
- PLEG relist duration monitoring (Kubernetes API responsiveness)
- Circuit breaker for fault protection
- Multiple authentication methods
- TLS verification with custom CA
Events Generated:
KubeletUnhealthy(Error): Kubelet healthz check failedKubeletAuthFailure(Error): Authentication to metrics endpoint failedPLEGSlow(Warning): PLEG relist duration exceeds thresholdKubeletRecovered(Info): Kubelet health restored
Conditions:
KubeletUnhealthy: True when failure threshold exceeded
Example Status:
events:
- severity: Warning
reason: PLEGSlow
message: "PLEG relist duration 7.2s exceeds 5s threshold"
conditions:
- type: KubeletUnhealthy
status: "True"
reason: HealthzFailed
message: "Kubelet healthz endpoint failed 3 consecutive checks"Monitors Kubernetes API server connectivity, latency, and authentication.
Monitor Type: kubernetes-apiserver-check
Source File: pkg/monitors/kubernetes/apiserver.go:514
Configuration:
monitors:
- name: apiserver-health
type: kubernetes-apiserver-check
interval: 30s
timeout: 15s
config:
endpoint: https://kubernetes.default.svc.cluster.local
latencyThreshold: 2s
checkVersion: true
checkAuth: true
failureThreshold: 3
httpTimeout: 10sDefault Values:
endpoint: https://kubernetes.default.svc.cluster.local (in-cluster)latencyThreshold: 2 secondscheckVersion: truecheckAuth: truefailureThreshold: 3httpTimeout: 10 seconds
Key Features:
- Uses Kubernetes client-go for API interaction
- In-cluster ServiceAccount authentication (automatic)
- GET /version endpoint for lightweight health checking
- Latency measurement and threshold alerting
- Authentication failure detection (401/403)
- Rate limiting detection (429)
- Error sanitization (prevents token leakage in logs)
Events Generated:
APIServerUnreachable(Error): Cannot reach API serverAPIServerSlow(Warning): API latency exceeds thresholdAPIServerAuthFailure(Error): Authentication failedAPIServerRateLimited(Warning): Rate limit detectedAPIServerRecovered(Info): API server reachable again
Conditions:
APIServerUnreachable: True when failure threshold exceeded
Security:
- Error message sanitization prevents sensitive data leakage
- No authentication tokens in logs or events
- Generic error messages for auth/TLS failures
Example Status:
events:
- severity: Error
reason: APIServerUnreachable
message: "API server unreachable after 3 consecutive failures: connection refused"
- severity: Warning
reason: APIServerSlow
message: "API server latency 3.5s exceeds threshold 2.0s"Monitors container runtime health (Docker, containerd, CRI-O).
Monitor Type: kubernetes-runtime-check
Source File: pkg/monitors/kubernetes/runtime.go:618
Configuration:
monitors:
- name: runtime-health
type: kubernetes-runtime-check
interval: 30s
timeout: 10s
config:
runtimeType: auto # auto, docker, containerd, crio
dockerSocket: /var/run/docker.sock
containerdSocket: /run/containerd/containerd.sock
crioSocket: /var/run/crio/crio.sock
checkSocketConnectivity: true
checkSystemdStatus: true
checkRuntimeInfo: true
failureThreshold: 3
timeout: 5sDefault Values:
runtimeType: auto (auto-detect)- Socket paths: Standard locations for each runtime
checkSocketConnectivity: truecheckSystemdStatus: truecheckRuntimeInfo: truefailureThreshold: 3timeout: 5 seconds
Runtime Auto-Detection:
- Check for Docker socket at
/var/run/docker.sock - Check for containerd socket at
/run/containerd/containerd.sock - Check for CRI-O socket at
/var/run/crio/crio.sock - Use first detected runtime
Check Types:
- Socket Connectivity: Unix socket connection test
- Systemd Status:
systemctl is-active <service>check - Runtime Info: Basic API connectivity verification
Key Features:
- Multi-runtime support with auto-detection
- Socket accessibility testing
- Systemd service health checking
- Systemd state awareness (active, inactive, failed, activating, etc.)
- Custom socket path override
- Graceful handling of missing runtimes
Events Generated:
RuntimeSocketUnreachable(Warning): Cannot connect to runtime socketRuntimeSystemdInactive(Warning): Systemd service not activeRuntimeInfoFailed(Warning): Failed to retrieve runtime infoRuntimeHealthy(Info): All runtime checks passedRuntimeRecovered(Info): Runtime health restored
Conditions:
ContainerRuntimeUnhealthy: True when failure threshold exceeded
Example Status:
events:
- severity: Warning
reason: RuntimeSystemdInactive
message: "Container runtime systemd service (docker) is not active"
conditions:
- type: ContainerRuntimeUnhealthy
status: "True"
reason: HealthCheckFailed
message: "Container runtime (docker) has failed health checks for 3 consecutive attempts"Monitors pod capacity on Kubernetes nodes and alerts when approaching limits.
Monitor Type: kubernetes-capacity-check
Source File: pkg/monitors/kubernetes/capacity.go:512
Configuration:
monitors:
- name: capacity-health
type: kubernetes-capacity-check
interval: 60s
timeout: 15s
config:
nodeName: "" # Auto-detected from NODE_NAME env var
warningThreshold: 90 # 90% capacity
criticalThreshold: 95 # 95% capacity
failureThreshold: 3
apiTimeout: 10s
checkAllocatable: true # Use allocatable vs capacityDefault Values:
nodeName: Auto-detected fromNODE_NAMEenvironment variablewarningThreshold: 90%criticalThreshold: 95%failureThreshold: 3apiTimeout: 10 secondscheckAllocatable: true
Note: Threshold values (warningThreshold, criticalThreshold) are automatically validated to be in the range 0-100. See Configuration Guide for details.
Allocatable vs Capacity:
- Allocatable: Pods actually schedulable (capacity minus system reservations)
- Capacity: Total pod slots on the node
Key Features:
- Kubernetes API integration for pod counting
- Only counts pods in Running phase
- Node name auto-detection from environment
- Separate warning and critical thresholds
- State transition tracking (normal → warning → critical)
- Recovery event generation
- In-cluster authentication via ServiceAccount
Events Generated:
PodCapacityWarning(Warning): Capacity 90-94%PodCapacityPressure(Error): Capacity ≥95%PodCapacityRecovered(Info): Capacity returned to normalCapacityCheckFailed(Error): Failed to query capacity
Conditions:
PodCapacityPressure: True when critical threshold exceededPodCapacityUnhealthy: True when repeated check failures
Example Status:
events:
- severity: Error
reason: PodCapacityPressure
message: "Node pod capacity at 96.5% (110/114 pods)"
conditions:
- type: PodCapacityPressure
status: "True"
reason: HighPodUtilization
message: "Pod capacity at 96.5% (110/114), exceeds 95% threshold"Executes custom external plugins for health checking with JSON or simple output formats.
Monitor Type: custom-plugin-check
Source File: pkg/monitors/custom/plugin.go:300
Configuration:
monitors:
- name: custom-gpu-health
type: custom-plugin-check
interval: 60s
timeout: 30s
config:
pluginPath: /usr/local/bin/check-gpu-health
args:
- --verbose
- --threshold=80
outputFormat: json # json or simple
failureThreshold: 3
apiTimeout: 10s
env:
GPU_DEVICE: "0"
LOG_LEVEL: "info"Default Values:
outputFormat: jsonfailureThreshold: 3apiTimeout: 10 seconds
Output Formats:
- JSON Output:
{
"status": "healthy",
"message": "GPU utilization at 45%",
"events": [
{
"severity": "info",
"reason": "GPUNormal",
"message": "GPU temperature: 65C"
}
]
}Valid status values: healthy, warning, critical, unknown
- Simple Output:
OK: GPU utilization at 45%
Exit codes:
- 0: Healthy
- 1: Warning
- 2: Critical
- Other: Unknown/Error
Key Features:
- External plugin execution with timeout
- JSON and simple (Nagios-style) output parsing
- Custom environment variable support
- Command-line argument passing
- Failure threshold tracking
- Plugin state validation
- Error handling and recovery
Events Generated:
- Plugin-defined events (from JSON output)
PluginCheckFailed(Error): Plugin execution failedPluginUnknownState(Warning): Plugin returned unknown status
Conditions:
PluginUnhealthy: True when plugin reports critical or repeated failures
Example Plugin (Bash):
#!/bin/bash
# GPU Health Check Plugin
gpu_temp=$(nvidia-smi --query-gpu=temperature.gpu --format=csv,noheader)
if [ "$gpu_temp" -lt 80 ]; then
status="healthy"
severity="info"
elif [ "$gpu_temp" -lt 90 ]; then
status="warning"
severity="warning"
else
status="critical"
severity="error"
fi
cat <<EOF
{
"status": "$status",
"message": "GPU temperature: ${gpu_temp}C",
"events": [
{
"severity": "$severity",
"reason": "GPUTemperature",
"message": "Current GPU temperature: ${gpu_temp}C"
}
]
}
EOFMonitors system logs for specific patterns with ReDoS protection and deduplication.
Monitor Type: custom-logpattern-check
Source File: pkg/monitors/custom/logpattern.go:350
Configuration:
monitors:
- name: log-pattern-health
type: custom-logpattern-check
interval: 30s
timeout: 10s
config:
useDefaults: true # Include default critical patterns
patterns:
- pattern: 'kernel: Out of memory'
severity: error
reason: OOMDetected
message: "Out of memory condition detected in kernel logs"
- pattern: 'segmentation fault'
severity: warning
reason: SegFault
message: "Segmentation fault detected"
- pattern: 'DENIED'
severity: info
reason: SELinuxDenial
message: "SELinux denial detected"
# Kernel log monitoring (choose one):
checkKernelJournal: true # PRIMARY: Use journalctl -k (recommended)
checkKmsg: false # FALLBACK: Direct /dev/kmsg access
kmsgPath: /dev/kmsg # Path for checkKmsg fallback
# Service unit log monitoring:
checkJournal: true # Enable systemd journal monitoring
journalUnits:
- kubelet
- docker
- containerd
maxEventsPerPattern: 10 # Max events per pattern per check (1-1000)
dedupWindow: 5m # Deduplication window (1s-1h)Default Values:
useDefaults: truecheckKernelJournal: true (PRIMARY - usesjournalctl -k)checkKmsg: false (FALLBACK - direct/dev/kmsgaccess)kmsgPath: /dev/kmsgcheckJournal: true (service unit logs)maxEventsPerPattern: 10 (range: 1-1000)dedupWindow: 5 minutes (range: 1s-1h)
Kernel Log Monitoring Methods:
| Method | Config Option | Command | Use Case |
|---|---|---|---|
| Journal (Primary) | checkKernelJournal: true |
journalctl -k --since |
Recommended - uses systemd journal, supports time-based filtering |
| Kmsg (Fallback) | checkKmsg: true |
Read /dev/kmsg |
Non-systemd systems or when journalctl unavailable |
Note: The container image includes the
journalctlbinary from thesystemdpackage to support kernel journal monitoring. If bothcheckKernelJournalandcheckKmsgare enabled, kernel journal takes precedence.
Default Patterns (when useDefaults=true):
System & Hardware:
- OOM kills:
killed process|Out of memory|oom-kill - Kernel panics:
Kernel panic|BUG: unable to handle - Hardware errors:
Machine check events|Hardware Error - Filesystem errors:
EXT4-fs error|XFS.*error|I/O error - Storage soft lockup:
soft lockup.*(?:longhorn|mpt|scsi|iscsi|nvme|nfs)(error)
VMware:
- vmxnet3 TX hang:
vmxnet3.*tx hang(error) - vmxnet3 NIC reset:
vmxnet3.*resetting(warning) - NSX errors:
nsx.*(?:error|failed|timeout)(error)
Networking - Conntrack/Netfilter:
- Conntrack table full:
nf_conntrack.*table full(error) - Conntrack dropping:
nf_conntrack.*dropping packet(error) - Netfilter error:
(?:nf_tables|nftables|netfilter).*error(error) - iptables error:
iptables.*(?:error|failed|invalid argument)(error) - iptables sync failed:
(?:iptables|kube-proxy).*sync.*failed(error)
Networking - NIC/Driver:
- NIC link down:
(?:e1000e|igb|ixgbe|mlx|bnxt|r8169).*Link is Down(warning) - NIC driver error:
(?:e1000|igb|ixgbe|mlx[45]|bnxt|i40e).*error(error) - NIC TX timeout:
NETDEV WATCHDOG.*transmit.*timed out(error) - NIC firmware error:
(?:firmware|nvram).*failed|Unable to load firmware(error) - Carrier lost:
carrier (?:lost|off)(warning)
Networking - Network Stack:
- Socket buffer overrun:
(?:socket buffer.*overrun|packets.*pruned.*socket|RcvbufErrors)(warning) - TCP retransmit error:
TCP.*retransmit.*timeout|tcp_retries.*exceeded(warning) - ARP resolution failed:
(?:ARP.*failed|no ARP.*reply|neighbor.*FAILED)(warning) - Route error:
(?:RTNETLINK.*error|route.*failed|no route to host)(error)
Networking - CNI:
- Calico error:
(?:calico|felix).*error(error) - Flannel error:
flannel.*(?:error|panic|failed)(error) - Cilium error:
cilium-agent.*(?:error|panic)|BPF.*load.*failed(error) - CNI plugin failed:
CNI plugin.*(?:error|failed|timeout)(error)
Networking - kube-proxy/IPVS:
- IPVS sync error:
ipvs.*(?:error|failed|sync failed)(error) - kube-proxy error:
kube-proxy.*(?:error|failed)(error) - Endpoint sync failed:
(?:endpoint.*syncing|UpdateEndpoints).*failed(warning)
Networking - Pod Networking:
- veth error:
veth.*(?:error|failed|cannot create)(error) - Network namespace error:
(?:netns|network namespace).*(?:error|failed)(error) - Pod network setup failed:
failed to (?:set up|setup).*network(error)
Cloud Provider:
- AWS ENI error:
(?:eni|ENI|vpc-cni).*(?:error|failed)(error) - Azure network error:
azure.*(?:cni|network).*(?:error|failed)(error)
Resource Limits:
- Maximum 60 patterns
- Maximum 20 journal units
- Pattern complexity scoring to prevent ReDoS
- Deduplication window: 1 second to 1 hour
ReDoS Protection:
The monitor validates regex patterns for safety:
- Complexity Scoring: Assigns penalty points for risky patterns
- Nested quantifiers:
(a+)+(high risk) - Overlapping alternations:
(a|ab)*(moderate risk) - Greedy quantifiers:
.*(low risk)
- Nested quantifiers:
- Threshold Rejection: Patterns exceeding complexity score rejected
- Timeout Enforcement: Context-based timeout for regex matching
Key Features:
- Kernel journal monitoring via
journalctl -k(primary, recommended) - Kernel message monitoring via
/dev/kmsg(fallback) - Systemd service unit journal monitoring (kubelet, containerd, docker)
- Regex pattern matching with safety validation
- Deduplication to prevent event flooding
- Custom pattern support
- Default critical pattern library
- Event rate limiting per pattern
- Time-based filtering (only processes new logs since last check)
- ARM64 compatibility
Events Generated:
- Pattern-defined events (custom severity/reason/message)
LogPatternCheckFailed(Error): Failed to read logs
Example Status:
events:
- severity: Error
reason: OOMDetected
message: "Out of memory condition detected in kernel logs"
- severity: Warning
reason: SegFault
message: "Segmentation fault detected (2 occurrences in last 5m)"VMware vmxnet3 TX Hang Detection:
The vmxnet3 patterns detect VMware virtual NIC transmit hangs that can cause cascade failures in storage systems, particularly in Longhorn environments. When a vmxnet3 TX hang occurs:
- TX Hang (
vmxnet3-tx-hang): The virtual NIC's transmit queue stalls - NIC Reset (
vmxnet3-nic-reset): VMware attempts automatic recovery - Storage Impact: iSCSI/Longhorn connections may timeout during the outage
- Soft Lockup (
soft-lockup-storage): CPU may appear stuck in storage I/O
Cascade Timeline Example:
02:02:36 vmxnet3 0000:03:00.0 ens160: tx hang # TX hang starts
02:02:36 vmxnet3 0000:03:00.0 ens160: resetting # NIC reset begins
02:02:39 vmxnet3 0000:03:00.0: intr vectors allocated # Recovery complete
02:07:03 soft lockup - CPU#2 stuck [longhorn-instan] # Storage impact
Recommended Response:
- Check ESXi host logs for underlying cause (vMotion, resource contention)
- Review Longhorn/storage replica health
- Consider increasing Longhorn timeouts in high-vmxnet3-hang environments
- This is a VMware virtualization layer issue, not a guest OS or Kubernetes problem
Symptom: Monitor appears in config but doesn't generate status updates
Possible Causes:
- Invalid configuration (check validation errors in logs)
- Timeout exceeds interval
- Missing dependencies (e.g., Kubernetes client for API server monitor)
Solution:
# Check logs for validation errors
journalctl -u node-doctor -f | grep -i error
# Verify configuration
node-doctor validate --config /etc/node-doctor/config.yaml
# Ensure timeout < interval
# timeout: 10s
# interval: 30s # Good: interval > timeoutSymptom: Monitor fails with "permission denied" errors
Common Locations:
/dev/kmsg: Requires CAP_SYSLOG or root/proc/diskstats: Requires read access- Container runtime sockets: Requires socket access
Solution:
# DaemonSet securityContext
securityContext:
privileged: true # Or specific capabilities
capabilities:
add:
- SYS_ADMIN
- SYS_RESOURCESymptom: Node Doctor consuming excessive memory
Possible Causes:
- Too many log pattern monitors
- Large deduplication windows
- Too many custom patterns
Solution:
# Reduce deduplication window
dedupWindow: 1m # Instead of 1h
# Limit patterns
maxEventsPerPattern: 5 # Instead of 100
# Reduce check frequency
interval: 60s # Instead of 10sSymptom: Expected events not appearing
Troubleshooting Steps:
- Check Monitor Status:
# View monitor list
kubectl exec -it <pod> -- node-doctor monitors list
# Check specific monitor
kubectl exec -it <pod> -- node-doctor monitors status <monitor-name>- Verify Thresholds:
# Lower thresholds for testing
warningThreshold: 50 # Instead of 85
criticalThreshold: 75 # Instead of 95- Check Failure Threshold:
# Reduce to see events sooner
failureThreshold: 1 # Instead of 3- Enable Debug Logging:
# In config.yaml
logging:
level: debugSymptom: TLS handshake failures, certificate verification errors
Kubelet Monitor:
config:
auth:
type: certificate
certPath: /path/to/client.crt
keyPath: /path/to/client.key
caPath: /path/to/ca.crt # Ensure CA matches server certAPI Server Monitor:
# Verify ServiceAccount token is mounted
volumeMounts:
- name: serviceaccount
mountPath: /var/run/secrets/kubernetes.io/serviceaccount
readOnly: trueSymptom: /dev/kmsg EINVAL errors on ARM64 nodes
Solution: This is expected behavior. The Memory Monitor handles this gracefully:
# OOM kill detection automatically disabled on ARM64 if /dev/kmsg fails
checkOOMKills: true # Will gracefully skip on ARM64Create a new package under pkg/monitors/:
package custom
import (
"context"
"github.com/supporttools/node-doctor/pkg/monitors"
"github.com/supporttools/node-doctor/pkg/types"
)
type CustomMonitorConfig struct {
// Your configuration fields
Threshold float64 `json:"threshold"`
}
type CustomMonitor struct {
*monitors.BaseMonitor
config *CustomMonitorConfig
}func NewCustomMonitor(ctx context.Context, config types.MonitorConfig) (types.Monitor, error) {
// Validate configuration
if err := ValidateCustomConfig(config); err != nil {
return nil, fmt.Errorf("invalid configuration: %w", err)
}
// Parse custom configuration
customConfig, err := parseCustomConfig(config.Config)
if err != nil {
return nil, fmt.Errorf("failed to parse config: %w", err)
}
// Create base monitor
baseMonitor, err := monitors.NewBaseMonitor(config.Name, config.Interval, config.Timeout)
if err != nil {
return nil, fmt.Errorf("failed to create base monitor: %w", err)
}
// Create custom monitor
monitor := &CustomMonitor{
BaseMonitor: baseMonitor,
config: customConfig,
}
// Set check function
if err := baseMonitor.SetCheckFunc(monitor.check); err != nil {
return nil, fmt.Errorf("failed to set check function: %w", err)
}
return monitor, nil
}func (m *CustomMonitor) check(ctx context.Context) (*types.Status, error) {
status := types.NewStatus(m.GetName())
// Perform your health check logic here
value := m.doHealthCheck(ctx)
// Evaluate against threshold
if value > m.config.Threshold {
status.AddEvent(types.NewEvent(
types.EventWarning,
"ThresholdExceeded",
fmt.Sprintf("Value %.2f exceeds threshold %.2f", value, m.config.Threshold),
))
status.AddCondition(types.NewCondition(
"Unhealthy",
types.ConditionTrue,
"ThresholdExceeded",
fmt.Sprintf("Custom metric exceeded threshold"),
))
}
return status, nil
}func init() {
monitors.Register(monitors.MonitorInfo{
Type: "custom-mycheck",
Factory: NewCustomMonitor,
Validator: ValidateCustomConfig,
Description: "Monitors custom metric",
})
}func parseCustomConfig(configMap map[string]interface{}) (*CustomMonitorConfig, error) {
config := &CustomMonitorConfig{}
if val, ok := configMap["threshold"]; ok {
switch v := val.(type) {
case float64:
config.Threshold = v
case int:
config.Threshold = float64(v)
default:
return nil, fmt.Errorf("threshold must be a number")
}
} else {
config.Threshold = 80.0 // Default
}
return config, nil
}func ValidateCustomConfig(config types.MonitorConfig) error {
if config.Name == "" {
return fmt.Errorf("monitor name is required")
}
if config.Type != "custom-mycheck" {
return fmt.Errorf("invalid monitor type: %s", config.Type)
}
// Validate custom fields
customConfig, err := parseCustomConfig(config.Config)
if err != nil {
return err
}
if customConfig.Threshold < 0 || customConfig.Threshold > 100 {
return fmt.Errorf("threshold must be between 0 and 100")
}
return nil
}- Thread Safety: Use mutexes for shared state
- Context Awareness: Respect context cancellation
- Error Handling: Return meaningful errors, don't panic
- Logging: Use structured logging for debugging
- Testing: Write unit tests with mock dependencies
- Documentation: Document configuration fields and defaults
- Validation: Validate configuration early (fail fast)
- Resource Cleanup: Clean up resources in stop logic
- Failure Tracking: Implement consecutive failure thresholds
- Recovery Events: Report when issues resolve
func TestCustomMonitor(t *testing.T) {
config := types.MonitorConfig{
Name: "test-custom",
Type: "custom-mycheck",
Interval: 30 * time.Second,
Timeout: 5 * time.Second,
Config: map[string]interface{}{
"threshold": 75.0,
},
}
monitor, err := NewCustomMonitor(context.Background(), config)
if err != nil {
t.Fatalf("Failed to create monitor: %v", err)
}
// Start monitor
statusChan := monitor.Start()
// Wait for first status
select {
case status := <-statusChan:
// Verify status
if status.Source != "test-custom" {
t.Errorf("Expected source 'test-custom', got '%s'", status.Source)
}
case <-time.After(10 * time.Second):
t.Fatal("Timeout waiting for status")
}
// Stop monitor
monitor.Stop()
}Node Doctor provides a comprehensive monitoring framework with:
- 12 Built-in Monitors covering system, network, Kubernetes, and custom checks
- Pluggable Architecture for easy extension
- Thread-Safe Design for production reliability
- Failure Threshold Tracking to prevent false positives
- Recovery Detection for automatic issue resolution reporting
- Flexible Configuration via YAML with sensible defaults
For additional support, see: