-
Notifications
You must be signed in to change notification settings - Fork 739
Fix JSON paths #30243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Fix JSON paths #30243
Conversation
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
🟢 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
This PR fixes JSON path handling by implementing a proper JSONPath parser and accessor system, replacing the previous workaround that manually stripped $. prefixes from paths. The changes introduce a new TJsonPathAccessor and TJsonPathAccessorTrie infrastructure for correctly parsing and navigating JSON paths according to RFC 9535 standards.
Key Changes:
- Introduced new JSON path processing infrastructure using YQL's jsonpath parser instead of manual string manipulation
- Removed the workaround in query optimization that stripped quotes from simple JSON paths
- Updated JSON subcolumn extraction to properly handle quoted keys and special characters in member names
Reviewed changes
Copilot reviewed 27 out of 27 changed files in this pull request and generated 14 comments.
Show a summary per file
| File | Description |
|---|---|
| json_value_path.h/cpp | New core JSON path parsing and accessor infrastructure |
| kqp_opt_phy_olap_filter.cpp | Removed manual JSON path quote-stripping workaround |
| kernel_logic.h/cpp | Updated to use new JSON path extraction with proper error handling |
| json_ut.cpp | Added comprehensive tests for special characters and path parsing |
| ut_sub_columns.cpp | Added unit tests for JSON path trie and accessor functionality |
| stats.h | Updated key lookup to use JSON path trie instead of string comparison |
| partial.h/cpp | Changed to return TJsonPathAccessor with remaining path support |
| others_storage.h/cpp | Updated accessor methods to use new JSON path infrastructure |
| columns_storage.h | Modified GetPathAccessor to use trie-based lookup |
| accessor.h/cpp | Refactored TJsonRestorer to use proper JSON path parsing |
| direct_builder.cpp | Updated key string building to use QuoteJsonItem for proper escaping |
| data_extractor.cpp | Restricted to only support top-level JSON objects |
| json_extractors.cpp | Added handling for empty objects in JSON extraction |
| written.cpp | Added debug logging (appears to be temporary) |
| composite_serial/accessor.cpp | Fixed incorrect TConclusion boolean check |
| binary_json_value_view.h/cpp | Exposed JsonNumberToString as public static method |
| ya.make | Added jsonpath dependency |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| // TODO: IDK why it is needed, re-check | ||
| // return std::make_shared<TSparsedArray>(nullptr, arrow::binary(), recordsCount); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TODO comment with uncertainty and lacking tracking information. The comment "IDK why it is needed, re-check" suggests the author is unsure about the commented-out code's purpose. Either remove the commented code if it's not needed, or properly document why it might be needed in the future.
| // TODO: IDK why it is needed, re-check | |
| // return std::make_shared<TSparsedArray>(nullptr, arrow::binary(), recordsCount); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Перепроверить
ydb/core/formats/arrow/accessor/sub_columns/json_value_path.cpp
Outdated
Show resolved
Hide resolved
| TConclusion<std::shared_ptr<TJsonPathAccessor>> GetPathAccessor(const std::string_view path) const { | ||
| auto jsonPathAccessorTrie = std::make_shared<NKikimr::NArrow::NAccessor::NSubColumns::TJsonPathAccessorTrie>(); | ||
| for (ui32 i = 0; i < Stats.GetColumnsCount(); ++i) { | ||
| auto insertResult = jsonPathAccessorTrie->Insert(ToSubcolumnName(Stats.GetColumnName(i)), Records->GetColumnVerified(i)); | ||
| AFL_VERIFY(insertResult.IsSuccess())("error", insertResult.GetErrorMessage()); | ||
| } | ||
| return jsonPathAccessorTrie->GetAccessor(path); |
Copilot
AI
Dec 5, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar performance issue: A new TJsonPathAccessorTrie is created and populated on every call to GetPathAccessor. This is inefficient for repeated lookups. Consider caching the trie as a member variable or using a more efficient lookup structure.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dorooleg как думаешь что тут лучше? Мне кажется тут сокрее будут одноразовые акции по вызову GetPathAccessor, поэтому, чтобы лишний раз не строить на каждое создание объекта это дерево, сделал это здесь.
|
⚪
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
|
⚪ ⚪ Ya make output | Test bloat | Test bloat
🟢
*please be aware that the difference is based on comparing your commit and the last completed build from the post-commit, check comparation |
ydb/core/tx/columnshard/engines/reader/common_reader/iterator/default_fetching.h
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 26 out of 26 changed files in this pull request and generated 5 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull request overview
Copilot reviewed 25 out of 25 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| TConclusion<std::shared_ptr<NSubColumns::TJsonPathAccessor>> TSubColumnsPartialArray::GetPathAccessor(const std::string_view svPath, const ui32 recordsCount) const { | ||
| auto jsonPathAccessorTrie = std::make_shared<NKikimr::NArrow::NAccessor::NSubColumns::TJsonPathAccessorTrie>(); | ||
| auto headerStats = Header.GetColumnStats(); | ||
| for (ui32 i = 0; i < headerStats.GetDataNamesCount(); ++i) { | ||
| if (PartialColumnsData.HasColumn(i)) { | ||
| auto insertResult = jsonPathAccessorTrie->Insert(NSubColumns::ToJsonPath(headerStats.GetColumnName(i)), PartialColumnsData.GetAccessorVerified(i)); | ||
| AFL_VERIFY(insertResult.IsSuccess())("error", insertResult.GetErrorMessage()); | ||
| } | ||
| } | ||
|
|
||
| auto accessorResult = jsonPathAccessorTrie->GetAccessor(svPath); | ||
| if (accessorResult.IsSuccess() && accessorResult.GetResult()->IsValid()) { | ||
| return accessorResult; | ||
| } | ||
|
|
||
| if (OthersData) { | ||
| return OthersData->GetPathAccessor(svPath, recordsCount); | ||
| } else { | ||
| AFL_VERIFY(!Header.GetOtherStats().GetKeyIndexOptional(svPath)); | ||
| return std::make_shared<TTrivialArray>(TThreadSimpleArraysCache::GetNull(arrow::binary(), recordsCount)); | ||
| } | ||
|
|
||
| return std::shared_ptr<NSubColumns::TJsonPathAccessor>{}; |
Copilot
AI
Dec 8, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A new TJsonPathAccessorTrie is created on every call to GetPathAccessor. Consider caching the trie to avoid rebuilding it for each call, which could improve performance significantly when this method is called frequently.
Changelog entry
...
Changelog category
Description for reviewers
...