Add SQL parser foundation: core types, arena, tokenizer, classifier#4
Add SQL parser foundation: core types, arena, tokenizer, classifier#4renecannao merged 44 commits intomainfrom
Conversation
Hand-written recursive descent parser with two-tier architecture: deep parse for SELECT/SET, classify+extract for everything else. Targets sub-microsecond latency with arena allocation and zero-copy.
…tement support - Define arena block-chaining strategy (never realloc, overflow blocks freed on reset) - Fix AstNode field ordering to achieve 32 bytes with static_assert - Add multi-statement query handling via ParseResult::remaining - Clarify PARTIAL semantics for both tiers - Fix BoundValue: add DATETIME/DECIMAL types, separate float32/float64 - Replace prepare_cache_store with parse_and_cache for safe API - Add threading note for session migration across threads - Correct RETURNING clause attribution (DML, not SELECT) - Mark cross-dialect emission as out of scope - Mandate StringRef as trivially copyable with static_assert - Add ErrorInfo lifetime documentation - Add max query length / arena size documentation
- Fix QualifiedIdentifier test to use non-keyword identifiers - Fix scan_to_end to use correct remaining calculation from the start - Extract ci_cmp to common.h, remove duplicates from keyword headers - Add NAMES keyword to PostgreSQL keyword table - Add missing tests for REPLACE, GRANT, REVOKE, LOCK, DEALLOCATE - Add input_end() to Tokenizer from the start
…aping - Mark unused `first` parameter in 12 Tier 2 extractors as /* first */ - Handle '' (doubled single-quote) escape in scan_single_quoted_string() - Fix unused variable warning in AllocateAligned test - Build now produces zero warnings with -Wall -Wextra
There was a problem hiding this comment.
Pull request overview
Adds the initial implementation of a dialect-templated SQL tokenizer + classifier/extractor parser (with arena allocation and core types), along with a new unit-test harness. The PR also vendors a full GoogleTest/GoogleMock distribution under third_party/ to support those tests.
Changes:
- Introduces core parser components:
StringRef,Arena,Tokenizer<Dialect>,Parser<Dialect>,ParseResult, keyword tables, and a compactAstNode. - Adds unit tests for arena, tokenizer, and classifier behavior (MySQL + PostgreSQL).
- Vendors GoogleTest/GoogleMock + associated docs/CI/build files into
third_party/googletest/.
Reviewed changes
Copilot reviewed 65 out of 264 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| third_party/googletest/googlemock/include/gmock/gmock-cardinalities.h | Vendored GoogleMock header (test dependency). |
| third_party/googletest/googlemock/docs/README.md | Vendored GoogleMock docs pointer. |
| third_party/googletest/googlemock/cmake/gmock_main.pc.in | Vendored pkg-config template for gmock_main. |
| third_party/googletest/googlemock/cmake/gmock.pc.in | Vendored pkg-config template for gmock. |
| third_party/googletest/googlemock/README.md | Vendored GoogleMock README. |
| third_party/googletest/googlemock/CMakeLists.txt | Vendored GoogleMock CMake build. |
| third_party/googletest/docs/samples.md | Vendored GoogleTest docs. |
| third_party/googletest/docs/reference/mocking.md | Vendored GoogleTest docs. |
| third_party/googletest/docs/reference/matchers.md | Vendored GoogleTest docs. |
| third_party/googletest/docs/reference/actions.md | Vendored GoogleTest docs. |
| third_party/googletest/docs/quickstart-cmake.md | Vendored GoogleTest docs. |
| third_party/googletest/docs/quickstart-bazel.md | Vendored GoogleTest docs. |
| third_party/googletest/docs/primer.md | Vendored GoogleTest docs. |
| third_party/googletest/docs/platforms.md | Vendored GoogleTest docs. |
| third_party/googletest/docs/pkgconfig.md | Vendored GoogleTest docs. |
| third_party/googletest/docs/index.md | Vendored GoogleTest docs index. |
| third_party/googletest/docs/gmock_faq.md | Vendored GoogleMock FAQ. |
| third_party/googletest/docs/gmock_cheat_sheet.md | Vendored GoogleMock cheat sheet. |
| third_party/googletest/docs/community_created_documentation.md | Vendored GoogleTest community docs listing. |
| third_party/googletest/docs/assets/css/style.scss | Vendored docs site styling. |
| third_party/googletest/docs/_sass/main.scss | Vendored docs site styling. |
| third_party/googletest/docs/_layouts/default.html | Vendored docs site layout. |
| third_party/googletest/docs/_data/navigation.yml | Vendored docs site navigation. |
| third_party/googletest/docs/_config.yml | Vendored docs site config. |
| third_party/googletest/ci/windows-presubmit.bat | Vendored upstream CI script. |
| third_party/googletest/ci/macos-presubmit.sh | Vendored upstream CI script. |
| third_party/googletest/ci/linux-presubmit.sh | Vendored upstream CI script. |
| third_party/googletest/WORKSPACE | Vendored Bazel workspace config. |
| third_party/googletest/README.md | Vendored GoogleTest README. |
| third_party/googletest/LICENSE | Vendored GoogleTest license. |
| third_party/googletest/CONTRIBUTORS | Vendored contributor list. |
| third_party/googletest/CONTRIBUTING.md | Vendored contributing guidelines. |
| third_party/googletest/CMakeLists.txt | Vendored top-level CMake entry. |
| third_party/googletest/BUILD.bazel | Vendored Bazel build file. |
| third_party/googletest/.gitignore | Vendored ignore patterns. |
| third_party/googletest/.github/workflows/gtest-ci.yml | Vendored GitHub Actions config. |
| third_party/googletest/.github/ISSUE_TEMPLATE/config.yml | Vendored issue template config. |
| third_party/googletest/.github/ISSUE_TEMPLATE/10-feature_request.yml | Vendored issue template. |
| third_party/googletest/.github/ISSUE_TEMPLATE/00-bug_report.yml | Vendored issue template. |
| third_party/googletest/.clang-format | Vendored clang-format config. |
| tests/test_tokenizer.cpp | Adds tokenizer unit tests for both dialects. |
| tests/test_main.cpp | Adds a gtest main() for the suite. |
| tests/test_classifier.cpp | Adds classifier/extractor tests for both dialects. |
| tests/test_arena.cpp | Adds arena allocator unit tests. |
| src/sql_parser/parser.cpp | Implements dialect-templated parser classification + tier-2 extractors. |
| src/sql_parser/arena.cpp | Implements arena allocator (block-chained, reset frees overflow). |
| include/sql_parser/tokenizer.h | Implements dialect-templated pull tokenizer with keyword lookup. |
| include/sql_parser/token.h | Defines TokenType and Token. |
| include/sql_parser/parser.h | Declares Parser<D> public API and extractor methods. |
| include/sql_parser/parse_result.h | Defines ParseResult and extracted metadata fields. |
| include/sql_parser/keywords_pgsql.h | PostgreSQL keyword table + binary-search lookup. |
| include/sql_parser/keywords_mysql.h | MySQL keyword table + binary-search lookup. |
| include/sql_parser/common.h | Defines Dialect, StringRef, StmtType, NodeType, ci_cmp. |
| include/sql_parser/ast.h | Defines compact AstNode and arena make_node. |
| include/sql_parser/arena.h | Declares arena allocator API. |
| docs/superpowers/specs/2026-03-24-sql-parser-design.md | Adds design spec for the new parser architecture. |
| Makefile.new | Adds build/test wiring for the new library and gtest-based unit tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Arena::Arena(size_t block_size, size_t max_size) | ||
| : block_size_(block_size), max_size_(max_size), total_allocated_(0) { | ||
| primary_ = allocate_block(block_size_); | ||
| current_ = primary_; | ||
| total_allocated_ = block_size_; | ||
| } |
There was a problem hiding this comment.
If allocate_block(block_size_) fails, primary_/current_ become null, but allocate() later dereferences current_ unconditionally. Please handle initial allocation failure (e.g., set primary_/current_ to nullptr, keep total_allocated_ at 0, and have allocate()/reset() gracefully return nullptr/no-op when current_/primary_ is null). Also consider validating max_size_ >= block_size_ (or clamping/adjusting) to keep accounting consistent.
| // -- line comment (MySQL requires space after --, PgSQL doesn't but we handle both) | ||
| if (c == '-' && peek_char(1) == '-') { | ||
| cursor_ += 2; | ||
| while (cursor_ < end_ && *cursor_ != '\n') ++cursor_; | ||
| continue; |
There was a problem hiding this comment.
This treats -- as a line comment unconditionally for all dialects. In MySQL, -- starts a comment only when followed by whitespace/control (or end of input). As written, valid expressions like SELECT 1--1 will be tokenized incorrectly (the --1 becomes a comment). Please make the -- recognition dialect-correct: always comment for PostgreSQL, but for MySQL require a valid comment terminator after the second - (space/tab/newline/\r/\0, etc.). Add a unit test covering 1--1 in MySQL to prevent regressions.
| // -- line comment (MySQL requires space after --, PgSQL doesn't but we handle both) | |
| if (c == '-' && peek_char(1) == '-') { | |
| cursor_ += 2; | |
| while (cursor_ < end_ && *cursor_ != '\n') ++cursor_; | |
| continue; | |
| // -- line comment | |
| if (c == '-' && peek_char(1) == '-') { | |
| if constexpr (D == Dialect::PostgreSQL) { | |
| // PostgreSQL: -- always starts a line comment | |
| cursor_ += 2; | |
| while (cursor_ < end_ && *cursor_ != '\n') ++cursor_; | |
| continue; | |
| } else { | |
| // MySQL: -- starts a comment only if followed by whitespace/control or end-of-input | |
| char after2 = peek_char(2); | |
| if (after2 == ' ' || after2 == '\t' || after2 == '\n' || | |
| after2 == '\r' || after2 == '\0') { | |
| cursor_ += 2; | |
| while (cursor_ < end_ && *cursor_ != '\n') ++cursor_; | |
| continue; | |
| } | |
| } |
| // MySQL: backtick-quoted identifier | ||
| Token scan_backtick_identifier() { | ||
| ++cursor_; // skip opening backtick | ||
| const char* content_start = cursor_; | ||
| while (cursor_ < end_ && *cursor_ != '`') ++cursor_; | ||
| uint32_t len = static_cast<uint32_t>(cursor_ - content_start); | ||
| if (cursor_ < end_) ++cursor_; // skip closing backtick | ||
| return make_token(TokenType::TK_IDENTIFIER, content_start, len); | ||
| } | ||
|
|
||
| // PostgreSQL: double-quoted identifier | ||
| Token scan_double_quoted_identifier() { | ||
| ++cursor_; // skip opening quote | ||
| const char* content_start = cursor_; | ||
| while (cursor_ < end_ && *cursor_ != '"') ++cursor_; | ||
| uint32_t len = static_cast<uint32_t>(cursor_ - content_start); | ||
| if (cursor_ < end_) ++cursor_; // skip closing quote | ||
| return make_token(TokenType::TK_IDENTIFIER, content_start, len); | ||
| } |
There was a problem hiding this comment.
Quoted identifiers can contain escaped quote characters (MySQL: doubled backtick ```` inside backticks; PostgreSQL: doubled quote \"\" inside quoted identifiers). These scanners stop at the first closing quote/backtick, so inputs like `ab` `` or `"a""b"` will be tokenized incorrectly and likely break downstream parsing. Please update both scanners to correctly consume doubled-quote escapes; if you want to keep `StringRef` zero-copy, at minimum you should consume the full token without prematurely terminating (even if unescaping is deferred/materialized later).
| Token Parser<D>::read_table_name(StringRef& schema_out) { | ||
| Token name = tokenizer_.next_token(); | ||
| if (name.type != TokenType::TK_IDENTIFIER && | ||
| name.type != TokenType::TK_EOF) { | ||
| // Keywords used as table names (e.g., CREATE TABLE `user`) | ||
| // The tokenizer returns keyword tokens for reserved words. | ||
| // Accept any non-punctuation token as a potential name. | ||
| } | ||
|
|
||
| // Check for qualified name: schema.table | ||
| if (tokenizer_.peek().type == TokenType::TK_DOT) { | ||
| schema_out = name.text; | ||
| tokenizer_.skip(); // consume dot | ||
| Token table = tokenizer_.next_token(); | ||
| return table; | ||
| } | ||
|
|
||
| schema_out = StringRef{}; | ||
| return name; | ||
| } |
There was a problem hiding this comment.
The comment says 'Accept any non-punctuation token as a potential name', but the code currently accepts any token type (including punctuation/operators) with no filtering. This can produce clearly incorrect metadata—for example, INSERT INTO (SELECT ...) would treat ( as the table token and set table_name to a 1-byte punctuation token. Please implement an actual validation step here (e.g., allow TK_IDENTIFIER and a restricted set of keyword tokens usable as identifiers, but reject punctuation/operators/literals and return an ERROR/PARTIAL result or an empty table_name).
| CXX = g++ | ||
| CXXFLAGS = -std=c++17 -Wall -Wextra -g -O2 | ||
| CPPFLAGS = -I./include -I./third_party/googletest/googletest/include |
There was a problem hiding this comment.
The PR description focuses on the new SQL parser foundation, but the diff also vendors a full GoogleTest/GoogleMock distribution and adds new build plumbing to compile it from source. Please reflect this explicitly in the PR description (or consider splitting the GoogleTest vendoring into a separate PR) so reviewers can assess third-party import/licensing/update implications independently from the parser changes.
Addresses review findings: - Handle NOT IN / NOT BETWEEN / NOT LIKE as compound infix operators - Add CASE/WHEN expression parsing - Support := assignment in SET parser - Move implementation includes to parser.cpp (not parser.h) - Add tests for all above + zero-arg functions
Plan 2 Complete: Expression Parser + SET Deep ParserAdded
Test Results
|
Runs on Ubuntu 22.04/24.04 with g++ and clang++, plus macOS. Builds libsqlparser.a and runs all tests via Makefile.new.
Fixes from review: - is_alias_start() default returns true (not false) - TK_UNION/TK_OF/EXISTS added before select_parser.h compiles - EXISTS expression handling added to expression parser
…BY, LIMIT, locking
Plan 3 Complete: SELECT Deep ParserAdded
Test Results
|
Normalizations documented: - CHARSET -> CHARACTER SET in output - TO -> = for PostgreSQL SET assignments - ISOLATION LEVEL re-inserted by emitter from isolation level value
Plan 4 Complete: Query EmitterAdded
Test Results
Current Progress
|
Plan 5: Prepared statement cache with LRU, deep-copy AST, bindings-aware emitter Plan 6: Google Benchmark performance tests for all parser operations
… bindings-aware emitter
Plans 5 & 6 Complete: Prepared Statement Cache + BenchmarksPlan 5: Prepared Statement Cache
Plan 6: Performance BenchmarksAll operations benchmarked with Google Benchmark (-O2, debug build):
*Tier 2 DML classification slightly over target in debug build; release build ( Final State
|
…and query digest Covers full MySQL + PostgreSQL syntax for INSERT, UPDATE, DELETE deep parsers, compound queries with INTERSECT precedence, and AST-based query digest with token-level fallback for Tier 2 statements.
…g, UPDATE AST - Extract TableRefParser from SelectParser as prerequisite refactoring - CompoundQueryParser as separate layer above SelectParser - UPDATE AST uses single NODE_FROM_CLAUSE with positional disambiguation - Document classifier switch updates and is_alias_start blocklist changes - Enumerate digest-mode emit method behavior changes - DEFAULT VALUES uses two-token approach (TK_DEFAULT + TK_VALUES)
Plan 7: TableRefParser extraction + INSERT/REPLACE deep parser (closes #5) Plan 8: UPDATE deep parser with multi-table JOIN + FROM (closes #6) Plan 9: DELETE deep parser with multi-table forms (closes #7) Plan 10: CompoundQueryParser with UNION/INTERSECT/EXCEPT (closes #8) Plan 11: Query digest with AST + token-level modes (closes #9)
Moves parse_from_clause(), parse_table_reference(), parse_join(), parse_optional_alias(), is_join_start(), and is_alias_start() into a standalone TableRefParser<D> utility class. SelectParser delegates to TableRefParser internally. All 236 existing tests pass unchanged.
New tokens: TK_DELAYED, TK_HIGH_PRIORITY, TK_DUPLICATE, TK_KEY, TK_CONFLICT, TK_DO, TK_NOTHING, TK_RETURNING, TK_CONSTRAINT. New node types: NODE_INSERT_STMT, NODE_INSERT_COLUMNS, NODE_VALUES_CLAUSE, NODE_VALUES_ROW, NODE_INSERT_SET_CLAUSE, NODE_ON_DUPLICATE_KEY, NODE_ON_CONFLICT, NODE_CONFLICT_TARGET, NODE_CONFLICT_ACTION, NODE_RETURNING_CLAUSE, NODE_UPDATE_SET_ITEM, NODE_STMT_OPTIONS. Updated is_alias_start() blocklist and keyword-as-identifier list.
InsertParser<D> handles full MySQL and PostgreSQL INSERT/REPLACE syntax:
- MySQL: INSERT/REPLACE [LOW_PRIORITY|DELAYED|HIGH_PRIORITY] [IGNORE]
[INTO] table [(cols)] {VALUES|SELECT|SET} [ON DUPLICATE KEY UPDATE]
- PostgreSQL: INSERT INTO table [(cols)] {VALUES|SELECT|DEFAULT VALUES}
[ON CONFLICT ...] [RETURNING ...]
Emitter extended with all INSERT-related node types. Classifier updated
to route TK_INSERT and TK_REPLACE to parse_insert(). Backward
compatibility maintained for table_name/schema_name extraction.
32 new tests covering MySQL and PostgreSQL INSERT/REPLACE: - Basic VALUES, multi-row, column list, INSERT SELECT, INSERT SET - ON DUPLICATE KEY UPDATE, IGNORE, LOW_PRIORITY, DELAYED, HIGH_PRIORITY - REPLACE with various options - PostgreSQL ON CONFLICT DO NOTHING/UPDATE, RETURNING, DEFAULT VALUES - Bulk data-driven tests (24 MySQL + 16 PostgreSQL variants) - Round-trip emitter tests (7 MySQL + 5 PostgreSQL) Also fixes is_alias_start() to block TK_LPAREN which was incorrectly treated as a valid implicit alias token. Total test count: 268 (was 236, +32 new)
New node types for the UPDATE deep parser. NODE_UPDATE_SET_ITEM and NODE_STMT_OPTIONS already exist from Plan 7.
Supports MySQL (LOW_PRIORITY, IGNORE, multi-table JOINs, ORDER BY, LIMIT) and PostgreSQL (ONLY, alias, FROM clause, RETURNING). Uses ExpressionParser for expressions and TableRefParser for table refs.
Add emit_update_stmt and emit_update_set_clause to emitter. Replace extract_update() Tier 2 extractor with parse_update() Tier 1 parser in classifier dispatch. Handles MySQL multi-table emission without FROM keyword and PostgreSQL FROM clause with keyword.
23 tests covering MySQL (simple, multi-column, options, ORDER BY, LIMIT, multi-table JOIN, comma join, LEFT JOIN) and PostgreSQL (simple, FROM, multi-table FROM, RETURNING, alias). Includes bulk data-driven tests (17 MySQL + 9 PostgreSQL variants) and round-trip emitter tests for both dialects.
Plan 9 Task 1: Add DELETE-specific node types to the NodeType enum for the upcoming DELETE deep parser.
Plan 9 Task 2: Implement DeleteParser<D> header-only template supporting: - MySQL single-table: DELETE [LOW_PRIORITY] [QUICK] [IGNORE] FROM table [WHERE] [ORDER BY] [LIMIT] - MySQL multi-table form 1: DELETE t1, t2 FROM table_refs [WHERE] - MySQL multi-table form 2: DELETE FROM t1, t2 USING table_refs [WHERE] - PostgreSQL: DELETE FROM [ONLY] table [[AS] alias] [USING using_list] [WHERE] [RETURNING]
Plan 9 Task 3: Add emit_delete_stmt/emit_delete_using to emitter, add parse_delete() to Parser, replace extract_delete() dispatch with parse_delete() in classifier.
Plan 9 Task 4: 25 new tests covering MySQL (simple, options, ORDER BY/LIMIT, multi-table form 1 and form 2), PostgreSQL (USING, RETURNING, aliases), bulk data-driven tests (14 MySQL + 9 PostgreSQL variants), and round-trip emission tests. All 316 tests pass.
…OPERATION node types Register INTERSECT and EXCEPT keywords in both MySQL and PostgreSQL keyword tables. Update is_alias_start() blocklist to prevent these keywords from being misinterpreted as implicit aliases.
Implements Pratt-style precedence parsing where INTERSECT binds tighter than UNION/EXCEPT. Handles parenthesized nesting, optional ALL modifier, and trailing ORDER BY/LIMIT on compound results. Returns bare NODE_SELECT_STMT when no set operator is present.
Update parse_select() to use CompoundQueryParser instead of SelectParser
directly. Add emit_compound_query and emit_set_operation to emitter. Handle
parenthesized SELECT at classifier level for queries starting with '('.
Move FLAG_SET_OP_ALL to common.h for shared access.
Create test_compound.cpp with 22 tests covering UNION, UNION ALL, INTERSECT, INTERSECT ALL, EXCEPT, EXCEPT ALL, precedence verification, parenthesized nesting, trailing ORDER BY/LIMIT, and round-trip emission. Add compound_mode flag to SelectParser that prevents eager consumption of ORDER BY/LIMIT/FOR clauses, allowing CompoundQueryParser to claim them as compound-level clauses. Fix parse_operand to consume SELECT keyword when present for subsequent SELECTs in compound expressions.
Implement AST-based and token-level query digest that normalizes SQL queries for rules matching. Literals become ?, IN lists collapse to IN (?), VALUES rows collapse to a single row, aliases are skipped, and keywords are uppercased. Both paths produce a normalized string and 64-bit FNV-1a hash. - Add EmitMode::DIGEST to Emitter with modified literal/IN/VALUES/alias emission - Create digest.h with Digest<D> class, DigestResult, FnvHash - Token-level fallback works for all statement types (Tier 2 included) - 34 new tests covering literal normalization, IN collapsing, VALUES collapsing, keyword uppercasing, hash consistency, cross-tier consistency, placeholder passthrough, NULL preservation
Plans 7-11 Complete: All Features ImplementedPlan 7: INSERT/REPLACE Deep Parser (closes #5)
Plan 8: UPDATE Deep Parser (closes #6)
Plan 9: DELETE Deep Parser (closes #7)
Plan 10: Compound Queries (closes #8)
Plan 11: Query Digest (closes #9)
Final State
|
Summary
Implements the foundational layers of a new hand-written recursive descent SQL parser for ProxySQL, replacing the existing Flex/Bison POC.
StringRef(zero-copy),Dialectenum,StmtType,NodeType,TokenTypeKey design decisions
if constexpr(zero runtime overhead)ParseResult::remainingfor multi-statement query supportPerformance targets (validated in follow-up)
Closes #3
Test plan
-Wall -Wextramake -f Makefile.new clean && make -f Makefile.new all