rfc(feature): SDK-symbolicated stack-frames#152
rfc(feature): SDK-symbolicated stack-frames#152supervacuus wants to merge 6 commits intogetsentry:mainfrom
Conversation
7abc153 to
00c6d10
Compare
There was a problem hiding this comment.
Overall makes sense.
We definitely shouldn't re-Use symbolicator_status, this field is not part of the Event ingestion, it's purely a symbolicator/sentry field (like you call out).
Inferring from other data seems not too bad, but an explicit signal is probably preferred and much less confusing.
There is also an option of re-using the platform for this, since that already controls symbolication. Either by extending it, replacing it or setting a magic noop (not really a fan of this).
Clarify language and punctuation in the RFC for SDK symbolication of stack frames, improving readability and consistency.
|
Thanks for taking the time, @Dav1dde!
Could I ask you to elaborate on what this would look like in practice? Do you mean in the sense that the Or do you mean the |
|
So as an example, I think if you were to set the Now obviously this isn't the greatest idea, but maybe there is a world where we go away from Examples:
We currently infer the Symbolication decision from the platform, but we could also make it explicit. This is a pretty unfiltered thought and definitely needs a lot more thought and bikeshedding. I think I'd prefer the flag over this, but possibly worth a discussion. |
Yes, this is a relevant point I didn't explicitly highlight (edit: added): we also tie the rendering of each frame to the That is also why I thought RFCing this makes sense: there are good reasons to do it entirely server-side as well as SDK-side. The fact that the SDK-set
Agree and matches the format. Right now, I am looking for clarity in the proposal (and counterproposals, thx) and for issues I haven't thought of (or incorrectly framed). So, this is welcome feedback at that level of filtering. |
|
|
||
| - **relay-event-schema**: Make `symbolicator_status` a recognized field that SDKs may set, and ensure it is not stripped during ingestion. | ||
| - **processing.py**: Check if `symbolicator_status` is already set before overwriting it with symbolicator's result. | ||
| - **SDKs**: Set `data.symbolicator_status` on symbolicated frames. |
There was a problem hiding this comment.
I see an Option D where The SDK and symbolicator use a common format to annotate the symbolication status. This would be future-proof towards what software component does the symbolication. Something like
"symbolication": {
"by": "symbolicator", // or "client"
"status": "success"
}Later we could even extend the protocol to log multiple unsuccessful attempts, if needed:
"symbolication": {
"by": "symbolicator",
"status": "success",
"previous_attempts": [
{"by": "client", "status": "failed", "reason": "unknown_image"}
]
}There was a problem hiding this comment.
I can see how this could be useful from a diagnostic/audit perspective, and I also agree that it would be more in line with what symbolicator_status currently does: it records the outcome of symbolicator's handling of the frame.
However, I think this is almost orthogonal to my intention for this RFC (though it could still be in scope if we decide to implement it), and I might have communicated it poorly by introducing symbolicator_status as a potential solution to the problem (because it acts more as a counter pattern).
These are the core questions for which I seek a design proposal from this RFC:
- Can we let the SDK decide whether processing/symbolicator should handle a particular stack frame or not?
- How far do we want to parameterize this (is this an on/off switch per frame, or do we want to open the encoding vs
platformas described here: rfc(feature): SDK-symbolicated stack-frames #152 (comment) and what about symbolicator responsibilities that go beyond symbols, like source-code and line numbers, etc.)? - Should SDKs have that power or not? While the two questions above are very much "does the shoe fit now?", this one is about a trajectory: do we already have capabilities lined up that would be inhibited by pushing this towards the SDK?
So, purely from an intent perspective, I see your proposal as a record after the fact, whereas I am currently looking for a mechanism to control that outcome. Maybe I misunderstood. Also, if anything in this comment wasn't clear from the RFC, let me know whether and where I should extend it. Thanks!
There was a problem hiding this comment.
Or, would you argue that this is also intended as a mechanism for control, because native processing could implement not to send a frame to symbolicator (or overwrite the client symbol if a symbolicaton request is attempted anyway) if a stack frame contains:
"symbolication": {
"by": "client",
"status": "success"
}Then it could be both diagnostic and control. But we should decide on the control meaning getting such a symbolication tag from the SDK.
There was a problem hiding this comment.
In any case, I integrated the feedback I got until now: bbd301c
As you can see, I just massaged your "Option D" into a control mechanism. Let me know if that aligns with what you had in mind.
There was a problem hiding this comment.
Or, would you argue that this is also intended as a mechanism for control
Yes, I would intend it mainly as a control mechanism which can also serve as a diagnostic tool or to give the user more visibility of what happened to their stack traces. Thank you for adding the option!
jjbayer
left a comment
There was a problem hiding this comment.
FWIW I think Option A would work perfectly fine. Option D is just an attempt to foresee potential future use-cases / unify the newly proposed protocol with the existing one.
| - **relay-event-schema**: Add a typed `symbolication` object to `Frame` with fields `by: enum("client", "symbolicator")`, `status: enum("success", "failed", "missing", "unknown")`, and optionally `reason: string` and `previous_attempts: list`. | ||
| - **SDKs**: Set `symbolication: { "by": "client", "status": "success" }` on frames the SDK has symbolicated. | ||
| - **processing.py**: Check for `symbolication.by == "client" && symbolication.status == "success"` early in frame handling; if present, skip symbolicator. Otherwise, proceed normally and write the symbolicator result into the same `symbolication` field. | ||
| - **symbolicator**: Write results into the `symbolication` object instead of (or in addition to) the current `symbolicator_status` field in `frame.data`. |
There was a problem hiding this comment.
Symbolicator could double-write into both fields for a while, until the sentry backend is updated to use the new field for a while.
|
|
||
| - **relay-event-schema**: Make `symbolicator_status` a recognized field that SDKs may set, and ensure it is not stripped during ingestion. | ||
| - **processing.py**: Check if `symbolicator_status` is already set before overwriting it with symbolicator's result. | ||
| - **SDKs**: Set `data.symbolicator_status` on symbolicated frames. |
There was a problem hiding this comment.
Or, would you argue that this is also intended as a mechanism for control
Yes, I would intend it mainly as a control mechanism which can also serve as a diagnostic tool or to give the user more visibility of what happened to their stack traces. Thank you for adding the option!
|
For JS this this would not have implications I think as there is no client-side symbolication apart from nextjs dev server. We can adapt out stackframe type definition if we land on a type here but would likely leave that undefined in all cases that we currently cover. 👍 |
| - How are the associated debug-meta images being resolved? If a frame is marked as "client-symbolicated" and thus doesn't show a "missing symbol" warning, images that are missing, but whose associated frames are entirely "client-symbolicated", shouldn't be marked as an error either (but should still be listed). Should this be resolved on the client (with a tag similar to the frame)? Or should this be resolved in native processing? | ||
| - Are there workflow interactions between symbolication, line-number, and source lookup that would be prevented by the presented approach? | ||
| - What is the interaction with demangling? If an SDK provides a mangled C++ symbol, should the backend still demangle it even if the frame is marked as client-symbolicated? | ||
| - Should the UI distinguish between server-symbolicated and client-symbolicated frames (e.g., a subtle indicator), or treat them identically? |
There was a problem hiding this comment.
I believe surfacing the origin of the symbolication could be beneficial for debugging absent symbols - which happens easily in case CI workflows uploading debug symbols soft-fail to do so. By comparing symbolicated issues to issues without symbols with indication of the source could indicate if it's an SDK issue or not.
| This problem is not theoretical. It already manifests today in at least two concrete scenarios: | ||
|
|
||
| - **Tombstone / native crash reporting**: Native SDKs that symbolicate system library frames on-device before sending the event. The backend cannot distinguish these from unsymbolicated frames and flags them as broken since it cannot find any symbol information in its stores. On Android in particular, we usually have both situations: native user libraries will be packaged stripped and `Symbolicator` must resolve associated frames (i.e., the UI warning is sensible if symbol/debug-info is missing), but system and framework libraries usually will be symbolicated on-device and `Symbolicator` won't have access to data to further enrich the stack frame. In that case, the UI error is misleading and inactionable to the user. | ||
| - **.NET SDK**: The .NET SDK resolves function names, file paths, and line numbers locally using portable PDB metadata. When these frames arrive at the backend, symbolicator still attempts to process them. Because the user hasn't uploaded PDB files to Sentry (the SDK already handled this), every frame returns `symbolicator_status: "missing"` and the UI shows a misleading symbolication error banner. This is tracked in [getsentry/sentry#97054](https://github.com/getsentry/sentry/issues/97054). |
There was a problem hiding this comment.
note: applies to JIT-compiled apps ... for Native AOT compiled apps, we only got function names, but I believe file names and line numbers are missing ... but Native AOT maybe a whole different story in this context
There was a problem hiding this comment.
or Native AOT compiled apps, we only got function names, but I believe file names and line numbers are missing
Okay, but is the assumption that those filenames and line numbers can be resolved later because the user will upload them as debug info?
but Native AOT maybe a whole different story in this context
How so?
There was a problem hiding this comment.
I think it's pretty unlikely we'd be symbolicating AOT compiled code on the client (the subject of this RFC)... but we should be able to resolve symbols later.
AFAIK during AOT compilation, when converting the IL into assembly, the linker can create a map from the native instructions back to the original IL... and this map gets stored in the symbol file. So even though the app is AOT compiled and you no longer have the original IL, you can still symbolicate if you have the maps in the symbol files.
Testing this out by running dotnet publish -c Release -f net10.0 samples/Sentry.Samples.Console.Basic/Sentry.Samples.Console.Basic.csproj to get an AOT complied app I see that we sometimes get line numbers and sometimes don't.
For this exception I see full symbolication of the system frames but not for the InApp frames:
System.InvalidOperationException: Something happened that crashed the app!
at Program.<>c__DisplayClass0_0.<<<Main>$>g__ThirdFunction|4>d.MoveNext()()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()() in ExceptionDispatchInfo.cs:line 53
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task)() in TaskAwaiter.cs:line 154
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task, ConfigureAwaitOptions)() in TaskAwaiter.cs:line 118
at Program.<<Main>$>d__0.MoveNext()()
at System.Runtime.ExceptionServices.ExceptionDispatchInfo.Throw()() in ExceptionDispatchInfo.cs:line 53
at System.Runtime.CompilerServices.TaskAwaiter.ThrowForNonSuccess(Task)() in TaskAwaiter.cs:line 154
at System.Runtime.CompilerServices.TaskAwaiter.HandleNonSuccessAndDebuggerNotification(Task, ConfigureAwaitOptions)() in TaskAwaiter.cs:line 118
at Program.<Main>(String[] args)()
at ?()
That will probably be an issue with not resolving the symbols I uploaded for that app... none the less, even the symbolication of the system frames demonstrates that we can symbolicate (file names and line numbers) AOT compiled code.
There was a problem hiding this comment.
Thanks for the input from both of you. Just to clarify: this proposal is only relevant for scenarios where full symbol coverage via a backend store is not an option (the classic example being Android, because there are more platform OEMs than we can realistically collect from). So, if you know that the uploaded debug info can fully recover every frame, there is no need for this feature.
The .NET-related issue I linked in the RFC just highlights a problem that aligns with what we are trying to solve here:
- The UI signals to the user that something is missing, and while that is technically true, it is not necessarily actionable.
- In that case, the client was able to symbolicate frames from
ConsoleOnboarding.dll, but the signal to the user is still that something is not right and needs to be fixed. - With a tag from the SDK indicating that symbolication happened on the client for a particular frame, the UI could check that all frames associated with that module have already been correctly symbolicated and, as such, lower the severity of the warning and call to action.
- Of course, in this particular case, there might still be a realistic chance that this is an okay warning (but the issue was created in response, so I guess even here the error is overstating the fact and confusing to users)
- In scenarios where the referenced module will never be uploaded, the warning is just factually wrong (and in the case of native, it would not only affect the module being marked as missing, but each associated frame would be marked, and the event would also get an error at the top, as shown in the screenshots)
| - **Symbolicator**: May need awareness of the flag if processing delegates the decision. | ||
| - **UI**: Should stop showing misleading "missing symbols" errors for frames that are intentionally client-symbolicated. | ||
|
|
||
| # Options Considered |
There was a problem hiding this comment.
question: Do I understand it correctly that when both sides are given (uploaded debug files & SDK-runtime-resolved metadata), the Client-provided metadata is preferred over the previously uploaded symbols?
There was a problem hiding this comment.
As part of this proposal's options, or how it is currently implemented?
This is a feature RFC to find a solution for SDK/client-side symbolicated stack frames:
Rendered RFC