Skip to content

Fix buffer overrun in livenessTracker.cpp#555

Merged
gh-worker-dd-mergequeue-cf854d[bot] merged 2 commits into
mainfrom
zgu/buffer-overrun
May 28, 2026
Merged

Fix buffer overrun in livenessTracker.cpp#555
gh-worker-dd-mergequeue-cf854d[bot] merged 2 commits into
mainfrom
zgu/buffer-overrun

Conversation

@zhengyu123
Copy link
Copy Markdown
Contributor

@zhengyu123 zhengyu123 commented May 28, 2026

What does this PR do?:
The fix addresses a critical bug in ddprof-lib/src/main/cpp/livenessTracker.cpp:354 where _table_cap was being updated before verifying that the realloc() call succeeded.

Motivation:
Fix possible memory corruption.

Additional Notes:
The original code assigned _table_cap = newcap inside the realloc() call itself:
TrackingEntry *tmp = (TrackingEntry *)realloc(
_table, sizeof(TrackingEntry) * (_table_cap = newcap));

If realloc() failed and returned nullptr, _table_cap would still be updated to the larger value, creating a mismatch between the recorded capacity and the actual allocated memory size. Subsequent operations would then write beyond the actual buffer bounds, causing a buffer overrun.

How to test the change?:

  • CI tests
  • New gtest

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.
  • JIRA: PROF-14816

Unsure? Have a question? Request a review!

zhengyu123 and others added 2 commits May 28, 2026 16:55
Adds comprehensive test coverage for the fix in commit 72c0e14 that
prevents _table_cap from being updated when realloc fails. The test
validates that capacity is only updated after successful reallocation,
preventing buffer overrun vulnerabilities.

Test includes 5 cases covering correct resize behavior, the buggy
pattern that was fixed, and edge cases with capacity limits.

Co-Authored-By: Claude Sonnet 4.5 <noreply@anthropic.com>
@datadog-prod-us1-3
Copy link
Copy Markdown

datadog-prod-us1-3 Bot commented May 28, 2026

Pipelines

Fix all issues with BitsAI

⚠️ Warnings

🚦 3 Pipeline jobs failed

DataDog/java-profiler | gtest-asan-amd64   View in Datadog   GitLab

🔧 Fix in code (Fix with Cursor). Memory leak detected: 295720 byte(s) leaked in 3 allocation(s) during tests in thread_teardown_safety_ut.

DataDog/java-profiler | gtest-tsan-amd64   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. User "system:serviceaccount:gitlab-runner:gitlab-runner-shared" cannot create resource "pods" in the namespace "default".

DataDog/java-profiler | gtest-tsan-arm64   View in Datadog   GitLab

🛟 This job is unlikely to succeed on retry. Please review your pipeline configuration. Job failed due to insufficient permissions to create pods in the 'default' namespace.

Useful? React with 👍 / 👎

This comment will be updated automatically if new data arrives.
🔗 Commit SHA: 516fef2 | Docs | Datadog PR Page | Give us feedback!

@dd-octo-sts
Copy link
Copy Markdown
Contributor

dd-octo-sts Bot commented May 28, 2026

CI Test Results

Run: #26600478483 | Commit: c712e11 | Duration: 12m 21s (longest job)

1 of 32 test jobs failed

Status Overview

JDK glibc-aarch64/debug glibc-amd64/debug musl-aarch64/debug musl-amd64/debug
8 - - -
8-ibm - - -
8-j9 - -
8-librca - -
8-orcl - - -
11 - - -
11-j9 - -
11-librca - -
17 - -
17-graal - -
17-j9 - -
17-librca - -
21 - -
21-graal - -
21-librca - -
25 - -
25-graal - -
25-librca - -

Legend: ✅ passed | ❌ failed | ⚪ skipped | 🚫 cancelled

Failed Tests

glibc-amd64/debug / 25-graal

Job: View logs

No detailed failure information available. Check the job logs.

Summary: Total: 32 | Passed: 31 | Failed: 1


Updated: 2026-05-28 20:47:31 UTC

@zhengyu123 zhengyu123 marked this pull request as ready for review May 28, 2026 19:29
@zhengyu123 zhengyu123 requested a review from a team as a code owner May 28, 2026 19:29
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 516fef26b8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines 361 to 364
} else {
Log::debug("Cannot add sampled object to Liveness tracking table, "
"resize attempt failed, the table is overflowing");
Log::debug("Cannot add sampled object to Liveness tracking table, "
"resize attempt failed, the table is overflowing");
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Delete the weak ref after a failed resize retry

When realloc returns null here and cleanup_table(true) did not free a slot, the subsequent goto retry re-enters with the unchanged _table_cap; the second insertion attempt falls through with idx == _table_cap and never stores or deletes the jweak created at the start of track. Under sustained full-table/OOM conditions this leaks one weak global reference per sampled object, so the failed-resize path should delete ref when the retry cannot insert it.

Useful? React with 👍 / 👎.

Comment on lines +127 to +129
bool success = mock.resizeTableBuggy(20);
EXPECT_TRUE(success);
EXPECT_EQ(mock.getCapacity(), 20);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Make the regression test exercise realloc failure

This test only calls resizeTableBuggy(20) on a normal small allocation, so the pre-fix pattern succeeds and the test would still pass; the rest of the failure scenario is comments with no assertion or mocked failed realloc. Because AGENTS.md requires bug-fix tests to fail before the fix, this should force a realloc failure and assert that capacity remains unchanged, otherwise the test cannot catch a future reintroduction of _table_cap = newcap before checking tmp.

Useful? React with 👍 / 👎.

@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot merged commit c712e11 into main May 28, 2026
180 of 186 checks passed
@gh-worker-dd-mergequeue-cf854d gh-worker-dd-mergequeue-cf854d Bot deleted the zgu/buffer-overrun branch May 28, 2026 20:31
@github-actions github-actions Bot added this to the 1.44.0 milestone May 28, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants