[BUG] cortex sandbox Does Not Report OOM Kills Distinctly From Other SIGKILL Causes

### Project

cortex

### Description

When a process running inside `cortex sandbox` is killed by the Linux OOM (Out-of-Memory) killer, cortex reports a generic exit code and error message that doesn't indicate OOM was the cause. This makes debugging memory issues extremely difficult, as the user has no indication that memory exhaustion occurred.

### Error Message

```shell
$ cortex sandbox linux -- python3 -c "x = [0] * (10**9)"
Sandbox process exited with code 137

# Exit code 137 = 128 + 9 (SIGKILL), but no indication WHY it was killed
# User has to guess: was it OOM? Was it timeout? Was it manual kill?
```

### Debug Logs

```shell
$ RUST_LOG=debug cortex sandbox linux -- python3 -c "x = [0] * (10**9)"
[DEBUG] Starting sandbox with cgroup memory limit: 512MB
[DEBUG] Spawning process: python3 -c "x = [0] * (10**9)"
[DEBUG] Process started with PID: 12345
[DEBUG] Process received signal: SIGKILL
[DEBUG] Sandbox terminated with exit code: 137
# No log about memory.current, memory.max, or oom_kill events!
```

### System Information

```shell
Bounty Version: 0.1.0
OS: Ubuntu 24.04 LTS
CPU: AMD EPYC-Genoa Processor (8 cores)
RAM: 15 GB
Cgroup: v2 (unified hierarchy)
```

### Screenshots

_No response_

### Steps to Reproduce

1. Start a sandbox with default memory limits:
   ```bash
   cortex sandbox linux -- python3 -c "x = [0] * (10**9)"
   ```

2. Observe the exit code:
   ```
   Sandbox process exited with code 137
   ```

3. Check the system journal for OOM events (manual detective work):
   ```bash
   sudo dmesg | grep -i oom
   # Shows: python3 invoked oom-killer: gfp_mask=0x...
   ```

4. Note that cortex provided no indication that OOM was the cause

### Expected Behavior

Cortex should detect and report OOM kills explicitly:

1. **Check cgroup OOM events**: Read `memory.events` after process exit to check if `oom_kill` counter increased
2. **Report OOM clearly**: 
   ```
   Sandbox process killed by OOM (used 512MB of 512MB limit)
   Exit code: 137 (OOM killed)
   ```
3. **Include memory stats**: Show peak memory usage vs limit
4. **Suggest remediation**: "Increase sandbox memory limit with --memory 1G"
5. **Structured output**: In `--json` mode, include OOM information:
   ```json
   {
     "exit_code": 137,
     "signal": "SIGKILL",
     "oom_killed": true,
     "memory_limit": "512MB",
     "memory_peak": "512MB"
   }
   ```

### Actual Behavior

When OOM kills the sandboxed process:
1. Cortex only reports the exit code (137)
2. No distinction between OOM kill, user-initiated kill, or other SIGKILL sources
3. No memory usage statistics provided
4. User must manually check `dmesg` or `/var/log/kern.log` to diagnose
5. `--json` output doesn't include any memory-related information

This is particularly problematic because:
- CI/CD systems need to distinguish OOM from other failures for automatic retry/resource-adjustment logic
- Users may waste time debugging code bugs when the issue is just insufficient memory allocation
- Memory leaks are hard to detect without seeing how close to the limit the process got

### Additional Context

With cgroup v2 (the default on modern Linux), detecting OOM is straightforward:

```bash
# Before process start
cat /sys/fs/cgroup/sandbox-12345/memory.events
# oom_kill 0

# After OOM kill
cat /sys/fs/cgroup/sandbox-12345/memory.events  
# oom_kill 1

# Memory at time of kill
cat /sys/fs/cgroup/sandbox-12345/memory.peak
# 536870912 (512MB)
```

Cortex should read these cgroup files after the sandboxed process exits to provide meaningful diagnostics. This is standard practice in container runtimes (Docker, Podman) and process supervisors (systemd).

The current behavior forces users to become Linux kernel experts just to understand why their sandbox process died, which defeats the purpose of having a user-friendly sandbox abstraction.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] cortex sandbox Does Not Report OOM Kills Distinctly From Other SIGKILL Causes #2172

Project

Description

Error Message

Debug Logs

System Information

Screenshots

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] cortex sandbox Does Not Report OOM Kills Distinctly From Other SIGKILL Causes #2172

Description

Project

Description

Error Message

Debug Logs

System Information

Screenshots

Steps to Reproduce

Expected Behavior

Actual Behavior

Additional Context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions