Skip to content

[BUG] cortex sandbox Does Not Report OOM Kills Distinctly From Other SIGKILL Causes #2172

@anthonlindblad

Description

@anthonlindblad

Project

cortex

Description

When a process running inside cortex sandbox is killed by the Linux OOM (Out-of-Memory) killer, cortex reports a generic exit code and error message that doesn't indicate OOM was the cause. This makes debugging memory issues extremely difficult, as the user has no indication that memory exhaustion occurred.

Error Message

$ cortex sandbox linux -- python3 -c "x = [0] * (10**9)"
Sandbox process exited with code 137

# Exit code 137 = 128 + 9 (SIGKILL), but no indication WHY it was killed
# User has to guess: was it OOM? Was it timeout? Was it manual kill?

Debug Logs

$ RUST_LOG=debug cortex sandbox linux -- python3 -c "x = [0] * (10**9)"
[DEBUG] Starting sandbox with cgroup memory limit: 512MB
[DEBUG] Spawning process: python3 -c "x = [0] * (10**9)"
[DEBUG] Process started with PID: 12345
[DEBUG] Process received signal: SIGKILL
[DEBUG] Sandbox terminated with exit code: 137
# No log about memory.current, memory.max, or oom_kill events!

System Information

Bounty Version: 0.1.0
OS: Ubuntu 24.04 LTS
CPU: AMD EPYC-Genoa Processor (8 cores)
RAM: 15 GB
Cgroup: v2 (unified hierarchy)

Screenshots

No response

Steps to Reproduce

  1. Start a sandbox with default memory limits:

    cortex sandbox linux -- python3 -c "x = [0] * (10**9)"
  2. Observe the exit code:

    Sandbox process exited with code 137
    
  3. Check the system journal for OOM events (manual detective work):

    sudo dmesg | grep -i oom
    # Shows: python3 invoked oom-killer: gfp_mask=0x...
  4. Note that cortex provided no indication that OOM was the cause

Expected Behavior

Cortex should detect and report OOM kills explicitly:

  1. Check cgroup OOM events: Read memory.events after process exit to check if oom_kill counter increased
  2. Report OOM clearly:
    Sandbox process killed by OOM (used 512MB of 512MB limit)
    Exit code: 137 (OOM killed)
    
  3. Include memory stats: Show peak memory usage vs limit
  4. Suggest remediation: "Increase sandbox memory limit with --memory 1G"
  5. Structured output: In --json mode, include OOM information:
    {
      "exit_code": 137,
      "signal": "SIGKILL",
      "oom_killed": true,
      "memory_limit": "512MB",
      "memory_peak": "512MB"
    }

Actual Behavior

When OOM kills the sandboxed process:

  1. Cortex only reports the exit code (137)
  2. No distinction between OOM kill, user-initiated kill, or other SIGKILL sources
  3. No memory usage statistics provided
  4. User must manually check dmesg or /var/log/kern.log to diagnose
  5. --json output doesn't include any memory-related information

This is particularly problematic because:

  • CI/CD systems need to distinguish OOM from other failures for automatic retry/resource-adjustment logic
  • Users may waste time debugging code bugs when the issue is just insufficient memory allocation
  • Memory leaks are hard to detect without seeing how close to the limit the process got

Additional Context

With cgroup v2 (the default on modern Linux), detecting OOM is straightforward:

# Before process start
cat /sys/fs/cgroup/sandbox-12345/memory.events
# oom_kill 0

# After OOM kill
cat /sys/fs/cgroup/sandbox-12345/memory.events  
# oom_kill 1

# Memory at time of kill
cat /sys/fs/cgroup/sandbox-12345/memory.peak
# 536870912 (512MB)

Cortex should read these cgroup files after the sandboxed process exits to provide meaningful diagnostics. This is standard practice in container runtimes (Docker, Podman) and process supervisors (systemd).

The current behavior forces users to become Linux kernel experts just to understand why their sandbox process died, which defeats the purpose of having a user-friendly sandbox abstraction.

Metadata

Metadata

Assignees

No one assigned

    Labels

    cortexIssues related to CortexLM/cortex repositoryvalidValid issue

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions