|
| 1 | +# investigate: Systematic Root Cause Debugging |
| 2 | + |
| 3 | +Structured debugging methodology that finds root causes before applying fixes. Use when |
| 4 | +a bug is reported, a test fails unexpectedly, or something "just stopped working." |
| 5 | + |
| 6 | +**Iron Law: No fixes without root cause investigation first.** |
| 7 | + |
| 8 | +## When to Use |
| 9 | + |
| 10 | +- Bug reports from users or QA |
| 11 | +- Test failures you don't immediately understand |
| 12 | +- "It was working yesterday" situations |
| 13 | +- Production errors or crashes |
| 14 | +- Performance regressions |
| 15 | + |
| 16 | +## Phase 1: Gather & Reproduce |
| 17 | + |
| 18 | +Before touching any code, understand the problem: |
| 19 | + |
| 20 | +1. **Collect symptoms** — What exactly is failing? Error messages, stack traces, screenshots, user reports. |
| 21 | +2. **Reproduce the issue** — Can you trigger it reliably? What are the exact steps? |
| 22 | +3. **Check recent changes** — `git log --oneline -20` and `git diff HEAD~5` — did something change recently? |
| 23 | +4. **Narrow the scope** — Is it one endpoint, one page, one function? Or widespread? |
| 24 | + |
| 25 | +If you cannot reproduce after 3 attempts, stop and ask the user for more context. |
| 26 | + |
| 27 | +## Phase 2: Analyze |
| 28 | + |
| 29 | +Match the symptoms against known patterns: |
| 30 | + |
| 31 | +| Pattern | Indicators | |
| 32 | +|---------|------------| |
| 33 | +| Race condition | Intermittent, timing-dependent, works in debugger | |
| 34 | +| Null/undefined propagation | TypeError, "cannot read property of null/undefined" | |
| 35 | +| State corruption | Works on first load, fails on subsequent interactions | |
| 36 | +| Data mismatch | Works with some data, fails with other data | |
| 37 | +| Environment issue | Works locally, fails in CI/staging/prod | |
| 38 | +| Dependency change | Worked before package update, lockfile changed | |
| 39 | +| Migration issue | DB-related errors after schema change | |
| 40 | +| Cache staleness | Works after hard refresh or cache clear | |
| 41 | +| Auth/session issue | Works when freshly logged in, fails later | |
| 42 | +| Concurrency issue | Works with one user, fails under load | |
| 43 | + |
| 44 | +## Phase 3: Hypothesize & Test |
| 45 | + |
| 46 | +1. **Form a hypothesis** — "I think X is happening because Y" |
| 47 | +2. **Design a test** — How can you prove or disprove this? Add targeted logging, write a minimal reproduction, check specific state. |
| 48 | +3. **Test the hypothesis** — Run the test. Does it confirm or refute? |
| 49 | +4. **If refuted** — Form a new hypothesis. Do NOT fix something that isn't the root cause. |
| 50 | +5. **3-strike rule** — If 3 hypotheses fail, stop and escalate. Share what you've tried. |
| 51 | + |
| 52 | +### Sanitize Before Searching |
| 53 | + |
| 54 | +When searching for errors online or in codebase: |
| 55 | +- Strip specific values (IDs, paths, timestamps) |
| 56 | +- Keep the error structure and type |
| 57 | +- Example: `TypeError: Cannot read property 'id' of undefined at UserService.getUser` → search for `TypeError: Cannot read property of undefined UserService` |
| 58 | + |
| 59 | +## Phase 4: Fix |
| 60 | + |
| 61 | +Only after root cause is confirmed: |
| 62 | + |
| 63 | +1. **Fix the root cause, not the symptom** — If a null value crashes downstream, fix where null is introduced, not where it crashes. |
| 64 | +2. **Minimal diff** — Change only what's necessary. Don't refactor while fixing. |
| 65 | +3. **Write a regression test** — A test that would have caught this bug before the fix, and passes after. |
| 66 | +4. **Verify the fix** — Run the full test suite. Manually reproduce the original steps and confirm the bug is gone. |
| 67 | +5. **Check blast radius** — Does this fix affect other code paths? Run `git diff --stat` — if >5 files changed, flag it. |
| 68 | + |
| 69 | +## Phase 5: Report |
| 70 | + |
| 71 | +After fixing, write a brief debug report: |
| 72 | + |
| 73 | +``` |
| 74 | +## Debug Report |
| 75 | +
|
| 76 | +**Issue:** [one-line description] |
| 77 | +**Root cause:** [what was actually wrong] |
| 78 | +**Fix:** [what was changed and why] |
| 79 | +**Regression test:** [test file:line that prevents recurrence] |
| 80 | +**Blast radius:** [what else might be affected] |
| 81 | +**Time spent:** [how long the investigation took] |
| 82 | +``` |
| 83 | + |
| 84 | +## Important Rules |
| 85 | + |
| 86 | +1. **Never apply unverified fixes.** "Maybe this will work" is not a fix — it's a guess. Verify first. |
| 87 | +2. **Read before writing.** Understand the code path before changing it. |
| 88 | +3. **One fix at a time.** Don't combine multiple fixes — you won't know which one worked. |
| 89 | +4. **Escalate early.** After 3 failed hypotheses, stop. Share findings and ask for help. |
| 90 | +5. **Flag large blast radius.** If a fix touches >5 files, pause and discuss with the user. |
| 91 | +6. **Don't optimize while debugging.** Fix the bug. Optimization is a separate task. |
| 92 | +7. **Check the obvious first.** Typos, wrong variable names, missing imports, incorrect config. |
| 93 | +8. **Trust error messages.** Read them carefully. They usually tell you exactly what's wrong. |
| 94 | +9. **Git blame is your friend.** When did this code change? Who changed it? What was the commit message? |
| 95 | +10. **Environment matters.** Check env vars, config files, database state, API versions. |
0 commit comments