Implement SCA Reachability runtime detection: report vulnerable classes and callsites via telemetry#11352
Implement SCA Reachability runtime detection: report vulnerable classes and callsites via telemetry#11352jandro996 wants to merge 72 commits into
Conversation
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: e607887e99
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 82ea8065d9
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
|
Codex Review: Didn't find any major issues. 👍 ℹ️ About Codex in GitHubYour team has set up Codex to review pull requests in this repo. Reviews are triggered when you
If Codex has suggestions, it will comment; otherwise it will react with 👍. Codex can also answer questions or update the PR. Try commenting "@codex address that feedback". |
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 6008ac9ca0
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 750b3c3877
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
|
@codex revew |
|
I reviewed the PR context and the currently checked-out code for the flagged areas. I don’t see any new blocking issues beyond what has already been discussed/resolved. Review result
Commands used
No code changes were made, so no commit or follow-up PR was created. |
|
@codex review |
1 similar comment
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 24c53f05cc
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
This comment has been minimized.
This comment has been minimized.
|
@codex review |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7b449cff45
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…ame, fix recordHit TOCTOU - Remove unused markPending() - the retry-via-re-mark flow was superseded by the emit-immediately approach in Step 3 of ScaReachabilityPeriodicAction - Inline private single-statement scheduleRetransformByName() at its only call site - Fix non-atomic containsKey+get in recordHit: use computeIfAbsent directly, which eliminates the race and avoids the unintended pendingReport=true side effect that unconditional registerCve() would have introduced - Update stale Javadoc in ScaReachabilityPeriodicAction referencing the old re-mark-pending retry strategy
…d (Option C)
When DD_APPSEC_SCA_ENABLED=true, ScaReachabilityPeriodicAction handles all
app-dependencies-loaded emission by merging DependencyService drains with CVE
registry state. DependencyPeriodicAction is skipped to avoid:
- Duplicate entries per dep per heartbeat
- DependencyService queue being drained before ScaReachabilityPeriodicAction
can see new JARs (which prevented knownDeps from being populated and caused
CVE entries to always emit without source/hash)
The previous attempt (commit b0421ab) failed system-tests due to a
markPending retry in Step 3 that could delay CVE emission by one heartbeat
cycle. That was fixed in 6925358 (immediate emit). The current code is safe.
…pection, StandardizedLogging
… symbols for same class An entry with symbols [Yaml.load, Yaml.loadAll] indexed once per symbol causing the same ScaEntry to appear twice for org/yaml/snakeyaml/Yaml. processClass then iterated the duplicate and injected bytecode callbacks twice per method, resulting in redundant ScaReachabilityCallback.onMethodHit() calls on every invocation. Fix: track seen class names per entry with a HashSet; add regression test.
…classloader scenarios ScaReachabilityCallback.reported and ScaReachabilityTransformer.reportedHits keyed hits without the artifact version. In a multi-classloader setup (multiple WARs, OSGi) where two versions of the same library coexist, the first version's hit suppressed all subsequent versions for the same CVE and symbol. Both keys now include version: vulnId|artifact|version|class|method.
When the call chain is client -> lib_wrapper -> vulnerable_method, findCallsite was reporting lib_wrapper instead of the client frame. Add sca_stack_exclusion.trie (generated as ScaStackExclusionTrie) and apply it after the vulnerable-class boundary to skip known library packages, matching the approach used by IAST SinkModuleBase. Refactor findCallsite to a testable overload that accepts an explicit StackTraceElement[], and rewrite tests with synthetic stacks.
Document that replace('.', '/') would produce wrong internal names for
inner classes if GHSA uses dot notation (e.g. Outer.Inner instead of
Outer$Inner). Current database has no inner class entries so safe for
now; track for when database team defines method-symbol format.
Addresses Manuel's review comment: all heavyweight processing now runs on the telemetry heartbeat thread instead of inline during class loading. Changes in ScaReachabilityTransformer: - transform() on first load (classBeingRedefined == null) only enqueues a PendingClass(className, jarURL) record and returns null immediately. No JAR I/O (DependencyResolver.resolve), no version lookup, no hit reporting on the class-loading thread. Matches the pattern used by LocationsCollectingTransformer. - New processPendingClassEvents(): drains the queue on the telemetry heartbeat, performs all heavyweight work (JAR reads, version resolution, class-level hit reporting), and populates pendingRetransformNames for method-level symbols. Must run before performPendingRetransforms() so classes queued in one heartbeat are retransformed in the same cycle. - transform() on retransform (classBeingRedefined != null) takes the existing inline path (processClass 4-arg) to inject method-level bytecode - required by the ClassFileTransformer contract. - Tradeoff vs old inline approach: ~10s window at startup during which method calls go undetected. No behavioral difference for class-level (the only active symbols today). - performPendingRetransforms(): use contains()+removeAll() instead of remove() inside the getAllLoadedClasses() loop. Spring Boot fat JARs create multiple LaunchedURLClassLoader instances; the old remove() only matched the first instance, so the application classloader was never retransformed and method-level callbacks never fired. All classloader instances of the same class name now get the callback bytecode injected. Additional cleanups in the same files: - Rename void processClass(3-arg) to reportClassLevelHits to disambiguate it from the byte[]-returning processClass(4-arg) that injects bytecode. - Extract duplicated method-level symbol stream expression into private static helper hasMethodLevelSymbolForClass(List<ScaEntry>, String). - ScaReachabilitySystem.start() periodic work callback updated to call processPendingClassEvents() before performPendingRetransforms(). - ASM injection: replace SIPUSH with visitLdcInsn(line) so line numbers > 32767 (valid in generated code) produce correct bytecode instead of a truncated signed-short value. - Defer dotClassName allocation to the method-level symbol path where it is actually needed. - Update processClass(4-arg) Javadoc to reflect two-phase design: only called on retransform for method-level symbols; class-level hits were already reported by reportClassLevelHits in processPendingClassEvents. - Two new tests in ScaReachabilityMethodLevelTest verify the structural invariants: first load enqueues and returns null; retransform does not re-enqueue.
ScaReachabilityRetransformTest verifies that performPendingRetransforms() retransforms ALL classloader instances of a class, not just the first one. Simulates Spring Boot's multiple LaunchedURLClassLoader instances by returning the same Class<?> twice from getAllLoadedClasses() — with the old remove() approach only one entry was passed to retransformClasses(); with the fix (contains()+removeAll()) both are collected. Also tests the re-queue path: on retransformClasses() failure, all collected classes must be added back to pendingRetransform for the next heartbeat retry. Adds mockito to appsec test dependencies. Makes pendingRetransformNames package-private (consistent with pendingRetransform and pendingClassEvents).
InternalError and other JVM Errors from retransformClasses() would escape a catch(Exception) block and kill the telemetry thread silently. Matches the existing catch(Throwable) in transform() for the same reason.
…buildSrc Move the inline generateScaCvesJson task from appsec/build.gradle into a proper Gradle plugin (ScaEnrichmentsPlugin) registered as 'dd-trace-java.sca-enrichments'. The plugin is simpler to test and removes ~90 lines of scripting from the build script. The plugin and committed sca_cves.json are a temporary solution — the symbol database will be delivered via Remote Config in a future iteration, at which point both will be removed.
Use pluginManager.withPlugin("java") to defer the processResources
dependency until after the java plugin has registered the task, avoiding
a configuration-time failure when the plugin is applied before java.
Add 4 integration tests using GradleFixture/TestKit:
- task is SKIPPED when file exists and -PrefreshSca is absent
- task attempts to run (not SKIPPED) when -PrefreshSca is set
- task attempts to run (not SKIPPED) when file is absent
- processResources depends on generateScaCvesJson
- Replace FQN types with proper imports (Set, ConcurrentHashMap in
ScaReachabilityCallback; Reader in ScaCveDatabase; Map.Entry in
ScaReachabilityDependencyRegistry)
- Narrow catch (Throwable) to catch (Exception) in onMethodHit: OOMError
and StackOverflowError are JVM conditions that must propagate to the
application; catch (Throwable) in performPendingRetransforms stays
because retransformClasses() can throw InternalError by spec
- Use Strings.getClassName() instead of manual replace('/', '.')
- Guard getAllLoadedClasses() loops against null items (class unloading race)
…Transformer datadog.trace.api.internal.VisibleForTesting is available transitively via components:annotations (already on classpath via internal-api). Replace package-private comments with the annotation.
resolveVersionForArtifact and injectMethodCallbacks were still using the comment form; align with the annotation applied to the other testing-only members.
…llsite Replace Thread.currentThread().getStackTrace() with StackWalkerFactory.INSTANCE.walk() so that on JDK9+ the stack is evaluated lazily: frames are walked one by one and evaluation stops as soon as the application callsite is found, without materializing the full stack array. Agent frame filtering is now handled by the walker itself (AbstractStackWalker.doFilterStack) rather than manually in the loop. The StackTraceElement[] overload used in tests is preserved; it delegates to the shared findCallsiteInStream helper with the agent filter applied manually on the array.
1b6a3b3 to
18d9540
Compare
|
Thanks @jpbempel for all the review comments! all applied (FQN cleanup, catch scope, VisibleForTesting, Strings.getClassName, null guards on getAllLoadedClasses, and StackWalkerFactory) Thanks @sarahchen6 for porting the SCA benchmark variant to apm-sdks-benchmarks and for the heads up on the migration (updated the PR description accordingly) Sorry for the delay! :( |
sarahchen6
left a comment
There was a problem hiding this comment.
Looks good from LP side!
|
|
||
| private static String readAll(Reader reader) throws IOException { | ||
| StringBuilder sb = new StringBuilder(); | ||
| char[] buf = new char[8192]; |
There was a problem hiding this comment.
nitpick: Use a constant for the buffer size
| new ScaReachabilityDependencyRegistry(); | ||
|
|
||
| /** Keyed by {@link #depKey(String, String)}. */ | ||
| private final ConcurrentHashMap<String, DependencyState> dependencies = new ConcurrentHashMap<>(); |
There was a problem hiding this comment.
Does it make sense to make this tuneable ?
Also, it might be useful to log when adding while map is at capacity.
| ClassLoader cl = Thread.currentThread().getContextClassLoader(); | ||
| while (cl != null) { |
There was a problem hiding this comment.
question: I I read the code right, the current thread is the telemetry thread ?
The transformer is added via the ScaReachabilitySystem
// processPendingClassEvents drains the first-load queue (JAR resolution + hit reporting);
// performPendingRetransforms then injects method-level callbacks for any classes it queued.
// Order matters: process events first so retransforms happen in the same heartbeat.
ScaReachabilityDependencyRegistry.INSTANCE.setPeriodicWorkCallback(
() -> {
transformer.processPendingClassEvents();
transformer.performPendingRetransforms();
});Then called in the ScaReachabilityPeriodicAction
@Override
public void doIteration(TelemetryService telService) {
// Trigger pending retransformations (method-level symbols on already-loaded classes, or
// classes where JAR version resolution failed at load time and needs a retry).
Runnable work = ScaReachabilityDependencyRegistry.INSTANCE.getPeriodicWorkCallback();
if (work != null) {
work.run();
}But at the periodic action, is run from the telemetry thread
However, the AgentThreadFactory sets the context classloader to null.
If my read is correct, the code never enters the while loop, and as such I think it might miss any app that has a custom classloader, e.g. osgi, multi tenant apps, etc. (I don't think java.class.path covers these deps)
| String version = findArtifactVersionInClasspath(artifactName); | ||
| if (version != null) { | ||
| classpathArtifactCache.put(artifactName, version); // only cache hits; misses are retried | ||
| } |
There was a problem hiding this comment.
question: I believe different versions of the artifact can shadow one another if both exist on the JVM in two distincgt isolated classloaders, Like with OSGi, etc.
Unsure how it's possible to achieve at this time but I'd rather let you know, since it's done on the reported hit map.
| // Computed lazily: only needed for method-level symbol injection. | ||
| String dotClassName = null; | ||
|
|
||
| for (ScaEntry entry : entries) { |
There was a problem hiding this comment.
question: This looks like this lookup can become a bottleneck for startup, e.g. if there are a lot of entries, this requires to lookup the class and walk again the entries list each time. I think it can be optimized, but I'm not sure yet, maybe with our trie, rather than doing in a loop equals.
| if (!Config.get().isAppSecScaEnabled()) { | ||
| return; | ||
| } |
There was a problem hiding this comment.
suggestion: Should this be disabled as well if telemetry is disabled ? DD_INSTRUMENTATION_TELEMETRY_ENABLED=false, and maybe also for DD_TELEMETRY_DEPENDENCY_COLLECTION_ENABLED (there's null check on dependencyService in TelemetrySystem).
| @@ -75,7 +76,14 @@ static Thread createTelemetryRunnable( | |||
| } | |||
| } | |||
| if (null != dependencyService) { | |||
There was a problem hiding this comment.
| "sca_cves.json is committed to the repo so CI does not need network access." | ||
| group = "build" | ||
| outputs.file(outputFile) | ||
| onlyIf { project.hasProperty("refreshSca") || !outputFile.exists() } |
There was a problem hiding this comment.
issue: I don't believe that works, the task up-to-date-ness can still make the task not refreshing forcibly the file, since it's based in task's inputs.
Instead, if you need to forcibly refresh depending only on the property, you. can use upToDateWhen API.
| onlyIf { project.hasProperty("refreshSca") || !outputFile.exists() } | |
| outputs.upToDateWhen { !project.hasProperty("refreshSca") } |
| val outputFile = project.file("src/main/resources/sca_cves.json") | ||
|
|
||
| val generateTask = | ||
| project.tasks.register("generateScaCvesJson") { |
There was a problem hiding this comment.
note: THis is fine for a temporary plugin, but if it's a longer term approach, I suggest to make a concrete task type.
|
|
||
| logger.lifecycle("Fetching GHSA enrichment index from GitHub...") | ||
| @Suppress("UNCHECKED_CAST") | ||
| val fileList = githubFetch(SCA_ENRICHMENTS_API, token) as List<Map<String, Any>> |
There was a problem hiding this comment.
suggestion: I'd suggest the add the ability to pass a custom URL, e.g. via a gradle property, that could be handy in CI.
Also, what happens if for some reason the content cannot be grabbed, 404, network error ?
Is it acceptable to not ship this file, otherwise that might be something to strengthen.
Maybe it's better to get these files via a git clone from within the CI ?
What Does This Do
Implements the SCA Reachability subsystem for the Java APM agent. When
DD_APPSEC_SCA_ENABLED=true, the agent detects at runtime which classes and methods from vulnerable library versions are actually loaded and invoked, and reports this viaapp-dependencies-loadedtelemetry using the RFC stateful heartbeat model.Build - Symbol database
generateScaCvesJsonindd-java-agent/appsec/build.gradledownloads GHSA enrichments from the privatesca-reachability-databaseand bundlessca_cves.jsonin the agent JAR. The committed copy is used in CI (no network access required at build time). Maintainers regenerate with./gradlew generateScaCvesJson -PrefreshSca.GhsaEnrichmentParser(buildSrc Kotlin): converts the GHSA enrichment format to the internal format - one record per Maven artifact, class symbols in JVM internal format (slashes). Filters to JVM language and Maven ecosystem only.Detection - ClassFileTransformer (two-phase design)
ScaReachabilityTransformeruses a two-phase approach to keep JAR I/O off the class-loading thread:classBeingRedefined == null):transform()only enqueues aPendingClass(className, jarURL)record and returnsnullimmediately. No I/O on the class-loading thread. Matches the pattern used byLocationsCollectingTransformer.processPendingClassEvents()): runs on the telemetry heartbeat thread; drains the queue, resolves artifact versions from JARpom.properties, evaluates version ranges, reports class-level hits, and schedules method-level retransformation.classBeingRedefined != null): injects ASM bytecode callbacks at vulnerable method entry points. Tradeoff: method calls during the ~10s startup window before the first heartbeat are not captured; there is no behavioral difference for class-level symbols (the only active symbols today).Additional transformer details:
ScaReachabilitySystem: entry point called fromAgent.javaby reflection (same pattern asAppSecSystem).ScaReachabilityCallback(bootstrap): dedup + dispatch bridge between application classloader (injected bytecode) and agent classloader (handler). Must live in bootstrap to be visible from any classloader.pom.propertiesfirst, then classpath scan as fallback for aggregator artifacts (e.g. Spring Boot starters). Java 9+ support viajava.class.pathfallback.VersionRangeParser: evaluates GHSA version range conditions (<,<=,>=,>,=, comma-separated AND).performPendingRetransforms(): usescontains()+removeAll()instead ofremove()inside thegetAllLoadedClasses()loop so that all classloader instances of the same class are retransformed (Spring Boot creates multipleLaunchedURLClassLoaderinstances).Callsite detection
ScaReachabilitySystem.findCallsite()walks the thread stack to report the application frame that called the vulnerable method, not the vulnerable method itself:StackWalkerFactory.INSTANCE.walk()is used for lazy stack evaluation on JDK9+ (avoids materializing the full stack array). Agent frames are pre-filtered by the walker.ScaStackExclusionTrie(generated fromsca_stack_exclusion.trieviatries.gradle) filters intermediate library frames so wrapper libraries are skipped and client code is reported.findCallsite(String, StackTraceElement[])for unit-testable stack simulation.Telemetry - RFC stateful heartbeat
ScaReachabilityDependencyRegistry(internal-api): stateful registry for CVE state perartifact@version.registerCve()creates an entry withreached:[];recordHit()stores the first callsite viaAtomicReference.compareAndSet(first-hit-wins).ScaReachabilityPeriodicAction(telemetry): on each heartbeat, drains bothDependencyService(new JARs) and the registry (CVE state changes), merges into one entry per dependency. ReplacesDependencyPeriodicActionwhen SCA is enabled to avoid duplicateapp-dependencies-loadedentries.Dependency: new nullablereachabilityMetadatafield;TelemetryRequestBody.writeDependency()writes themetadataarray when present.Benchmarks
The local
benchmark/folder has been removed from dd-trace-java (see #11502). Thescavariant (-Ddd.appsec.enabled=true -Ddd.appsec.sca.enabled=true) was contributed to the newapm-sdks-benchmarksimplementation by @sarahchen6 in DataDog/apm-sdks-benchmarks#161, based on the variant I added here originally.Motivation
Implements APPSEC-62260 - SCA Reachability runtime detection per the RFC.
Additional Notes
The
generateScaCvesJsonGradle plugin and the committedsca_cves.jsonare a temporary solution. In a future iteration the SCA symbol database will be delivered at runtime via Remote Config, at which point this plugin and the committed JSON file will be removed.sca_cves.jsonis committed to the repo becausesca-reachability-databaseis a private repo not accessible from CI without a token. Only the maintainer runs the refresh task.Method-level symbols in
sca_cves.json(snakeyamlload/loadAll, xstreamfromXML, etc.) are manually curated for testing. The database format does not yet define method-level entries; when it does, the Gradle task will be updated and the manual entries removed.AbstractStackWalker.isNotDatadogTraceStackElementvisibility changed from package-private topublicto allow callsite filtering from theappsecmodule.sca_stack_exclusion.trieis a curated copy of the IAST exclusion trie adapted for SCA callsite detection (testing framework entries removed). The trie cannot be reused directly from IAST without a circular dependency.The
ScaReachabilitySmokeTestextendsAbstractAppSecServerSmokeTestand remains Groovy until that base class is migrated to Java.Contributor Checklist
type:and (comp:orinst:) labels in addition to any other useful labelsclose,fix, or any linking keywords when referencing an issueUse
solvesinstead, and assign the PR milestone to the issueJira ticket: APPSEC-62260
Note: Once your PR is ready to merge, add it to the merge queue by commenting
/merge./merge -ccancels the queue request./merge -f --reason "reason"skips all merge queue checks; please use this judiciously, as some checks do not run at the PR-level. For more information, see this doc.