Skip to content

Add SQL parser foundation: core types, arena, tokenizer, classifier#4

Merged
renecannao merged 44 commits intomainfrom
feat/sql-parser-foundation
Mar 24, 2026
Merged

Add SQL parser foundation: core types, arena, tokenizer, classifier#4
renecannao merged 44 commits intomainfrom
feat/sql-parser-foundation

Conversation

@renecannao
Copy link
Copy Markdown
Collaborator

Summary

Implements the foundational layers of a new hand-written recursive descent SQL parser for ProxySQL, replacing the existing Flex/Bison POC.

  • Core types: StringRef (zero-copy), Dialect enum, StmtType, NodeType, TokenType
  • Arena allocator: Block-chained, 64KB default, 1MB max, O(1) reset, 8-byte aligned
  • AstNode: 32-byte compact struct with intrusive linked-list children
  • Tokenizer: Dialect-templated (MySQL + PostgreSQL), pull-based, zero-copy, keyword binary search
  • Classifier: Switch dispatch on first token, routes to Tier 1 stubs or Tier 2 extractors
  • Tier 2 extractors: Table name/schema extraction for DML/DDL, transaction classification, USE database

Key design decisions

  • Compile-time dialect dispatch via if constexpr (zero runtime overhead)
  • Header-only tokenizer for maximum inlining
  • Tier 1 (SELECT, SET) return PARTIAL — deep parsers in follow-up PR
  • ParseResult::remaining for multi-statement query support

Performance targets (validated in follow-up)

  • Tier 2 classification: <100ns
  • Arena reset: <10ns

Closes #3

Test plan

  • 57 unit tests across 5 suites (arena, MySQL tokenizer, PgSQL tokenizer, MySQL classifier, PgSQL classifier)
  • Zero compiler warnings with -Wall -Wextra
  • Clean build from scratch via make -f Makefile.new clean && make -f Makefile.new all

Hand-written recursive descent parser with two-tier architecture:
deep parse for SELECT/SET, classify+extract for everything else.
Targets sub-microsecond latency with arena allocation and zero-copy.
…tement support

- Define arena block-chaining strategy (never realloc, overflow blocks freed on reset)
- Fix AstNode field ordering to achieve 32 bytes with static_assert
- Add multi-statement query handling via ParseResult::remaining
- Clarify PARTIAL semantics for both tiers
- Fix BoundValue: add DATETIME/DECIMAL types, separate float32/float64
- Replace prepare_cache_store with parse_and_cache for safe API
- Add threading note for session migration across threads
- Correct RETURNING clause attribution (DML, not SELECT)
- Mark cross-dialect emission as out of scope
- Mandate StringRef as trivially copyable with static_assert
- Add ErrorInfo lifetime documentation
- Add max query length / arena size documentation
- Fix QualifiedIdentifier test to use non-keyword identifiers
- Fix scan_to_end to use correct remaining calculation from the start
- Extract ci_cmp to common.h, remove duplicates from keyword headers
- Add NAMES keyword to PostgreSQL keyword table
- Add missing tests for REPLACE, GRANT, REVOKE, LOCK, DEALLOCATE
- Add input_end() to Tokenizer from the start
…aping

- Mark unused `first` parameter in 12 Tier 2 extractors as /* first */
- Handle '' (doubled single-quote) escape in scan_single_quoted_string()
- Fix unused variable warning in AllocateAligned test
- Build now produces zero warnings with -Wall -Wextra
Copilot AI review requested due to automatic review settings March 24, 2026 02:21
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds the initial implementation of a dialect-templated SQL tokenizer + classifier/extractor parser (with arena allocation and core types), along with a new unit-test harness. The PR also vendors a full GoogleTest/GoogleMock distribution under third_party/ to support those tests.

Changes:

  • Introduces core parser components: StringRef, Arena, Tokenizer<Dialect>, Parser<Dialect>, ParseResult, keyword tables, and a compact AstNode.
  • Adds unit tests for arena, tokenizer, and classifier behavior (MySQL + PostgreSQL).
  • Vendors GoogleTest/GoogleMock + associated docs/CI/build files into third_party/googletest/.

Reviewed changes

Copilot reviewed 65 out of 264 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
third_party/googletest/googlemock/include/gmock/gmock-cardinalities.h Vendored GoogleMock header (test dependency).
third_party/googletest/googlemock/docs/README.md Vendored GoogleMock docs pointer.
third_party/googletest/googlemock/cmake/gmock_main.pc.in Vendored pkg-config template for gmock_main.
third_party/googletest/googlemock/cmake/gmock.pc.in Vendored pkg-config template for gmock.
third_party/googletest/googlemock/README.md Vendored GoogleMock README.
third_party/googletest/googlemock/CMakeLists.txt Vendored GoogleMock CMake build.
third_party/googletest/docs/samples.md Vendored GoogleTest docs.
third_party/googletest/docs/reference/mocking.md Vendored GoogleTest docs.
third_party/googletest/docs/reference/matchers.md Vendored GoogleTest docs.
third_party/googletest/docs/reference/actions.md Vendored GoogleTest docs.
third_party/googletest/docs/quickstart-cmake.md Vendored GoogleTest docs.
third_party/googletest/docs/quickstart-bazel.md Vendored GoogleTest docs.
third_party/googletest/docs/primer.md Vendored GoogleTest docs.
third_party/googletest/docs/platforms.md Vendored GoogleTest docs.
third_party/googletest/docs/pkgconfig.md Vendored GoogleTest docs.
third_party/googletest/docs/index.md Vendored GoogleTest docs index.
third_party/googletest/docs/gmock_faq.md Vendored GoogleMock FAQ.
third_party/googletest/docs/gmock_cheat_sheet.md Vendored GoogleMock cheat sheet.
third_party/googletest/docs/community_created_documentation.md Vendored GoogleTest community docs listing.
third_party/googletest/docs/assets/css/style.scss Vendored docs site styling.
third_party/googletest/docs/_sass/main.scss Vendored docs site styling.
third_party/googletest/docs/_layouts/default.html Vendored docs site layout.
third_party/googletest/docs/_data/navigation.yml Vendored docs site navigation.
third_party/googletest/docs/_config.yml Vendored docs site config.
third_party/googletest/ci/windows-presubmit.bat Vendored upstream CI script.
third_party/googletest/ci/macos-presubmit.sh Vendored upstream CI script.
third_party/googletest/ci/linux-presubmit.sh Vendored upstream CI script.
third_party/googletest/WORKSPACE Vendored Bazel workspace config.
third_party/googletest/README.md Vendored GoogleTest README.
third_party/googletest/LICENSE Vendored GoogleTest license.
third_party/googletest/CONTRIBUTORS Vendored contributor list.
third_party/googletest/CONTRIBUTING.md Vendored contributing guidelines.
third_party/googletest/CMakeLists.txt Vendored top-level CMake entry.
third_party/googletest/BUILD.bazel Vendored Bazel build file.
third_party/googletest/.gitignore Vendored ignore patterns.
third_party/googletest/.github/workflows/gtest-ci.yml Vendored GitHub Actions config.
third_party/googletest/.github/ISSUE_TEMPLATE/config.yml Vendored issue template config.
third_party/googletest/.github/ISSUE_TEMPLATE/10-feature_request.yml Vendored issue template.
third_party/googletest/.github/ISSUE_TEMPLATE/00-bug_report.yml Vendored issue template.
third_party/googletest/.clang-format Vendored clang-format config.
tests/test_tokenizer.cpp Adds tokenizer unit tests for both dialects.
tests/test_main.cpp Adds a gtest main() for the suite.
tests/test_classifier.cpp Adds classifier/extractor tests for both dialects.
tests/test_arena.cpp Adds arena allocator unit tests.
src/sql_parser/parser.cpp Implements dialect-templated parser classification + tier-2 extractors.
src/sql_parser/arena.cpp Implements arena allocator (block-chained, reset frees overflow).
include/sql_parser/tokenizer.h Implements dialect-templated pull tokenizer with keyword lookup.
include/sql_parser/token.h Defines TokenType and Token.
include/sql_parser/parser.h Declares Parser<D> public API and extractor methods.
include/sql_parser/parse_result.h Defines ParseResult and extracted metadata fields.
include/sql_parser/keywords_pgsql.h PostgreSQL keyword table + binary-search lookup.
include/sql_parser/keywords_mysql.h MySQL keyword table + binary-search lookup.
include/sql_parser/common.h Defines Dialect, StringRef, StmtType, NodeType, ci_cmp.
include/sql_parser/ast.h Defines compact AstNode and arena make_node.
include/sql_parser/arena.h Declares arena allocator API.
docs/superpowers/specs/2026-03-24-sql-parser-design.md Adds design spec for the new parser architecture.
Makefile.new Adds build/test wiring for the new library and gtest-based unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +16 to +21
Arena::Arena(size_t block_size, size_t max_size)
: block_size_(block_size), max_size_(max_size), total_allocated_(0) {
primary_ = allocate_block(block_size_);
current_ = primary_;
total_allocated_ = block_size_;
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If allocate_block(block_size_) fails, primary_/current_ become null, but allocate() later dereferences current_ unconditionally. Please handle initial allocation failure (e.g., set primary_/current_ to nullptr, keep total_allocated_ at 0, and have allocate()/reset() gracefully return nullptr/no-op when current_/primary_ is null). Also consider validating max_size_ >= block_size_ (or clamping/adjusting) to keep accounting consistent.

Copilot uses AI. Check for mistakes.
Comment on lines +79 to +83
// -- line comment (MySQL requires space after --, PgSQL doesn't but we handle both)
if (c == '-' && peek_char(1) == '-') {
cursor_ += 2;
while (cursor_ < end_ && *cursor_ != '\n') ++cursor_;
continue;
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This treats -- as a line comment unconditionally for all dialects. In MySQL, -- starts a comment only when followed by whitespace/control (or end of input). As written, valid expressions like SELECT 1--1 will be tokenized incorrectly (the --1 becomes a comment). Please make the -- recognition dialect-correct: always comment for PostgreSQL, but for MySQL require a valid comment terminator after the second - (space/tab/newline/\r/\0, etc.). Add a unit test covering 1--1 in MySQL to prevent regressions.

Suggested change
// -- line comment (MySQL requires space after --, PgSQL doesn't but we handle both)
if (c == '-' && peek_char(1) == '-') {
cursor_ += 2;
while (cursor_ < end_ && *cursor_ != '\n') ++cursor_;
continue;
// -- line comment
if (c == '-' && peek_char(1) == '-') {
if constexpr (D == Dialect::PostgreSQL) {
// PostgreSQL: -- always starts a line comment
cursor_ += 2;
while (cursor_ < end_ && *cursor_ != '\n') ++cursor_;
continue;
} else {
// MySQL: -- starts a comment only if followed by whitespace/control or end-of-input
char after2 = peek_char(2);
if (after2 == ' ' || after2 == '\t' || after2 == '\n' ||
after2 == '\r' || after2 == '\0') {
cursor_ += 2;
while (cursor_ < end_ && *cursor_ != '\n') ++cursor_;
continue;
}
}

Copilot uses AI. Check for mistakes.
Comment on lines +200 to +218
// MySQL: backtick-quoted identifier
Token scan_backtick_identifier() {
++cursor_; // skip opening backtick
const char* content_start = cursor_;
while (cursor_ < end_ && *cursor_ != '`') ++cursor_;
uint32_t len = static_cast<uint32_t>(cursor_ - content_start);
if (cursor_ < end_) ++cursor_; // skip closing backtick
return make_token(TokenType::TK_IDENTIFIER, content_start, len);
}

// PostgreSQL: double-quoted identifier
Token scan_double_quoted_identifier() {
++cursor_; // skip opening quote
const char* content_start = cursor_;
while (cursor_ < end_ && *cursor_ != '"') ++cursor_;
uint32_t len = static_cast<uint32_t>(cursor_ - content_start);
if (cursor_ < end_) ++cursor_; // skip closing quote
return make_token(TokenType::TK_IDENTIFIER, content_start, len);
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Quoted identifiers can contain escaped quote characters (MySQL: doubled backtick ```` inside backticks; PostgreSQL: doubled quote \"\" inside quoted identifiers). These scanners stop at the first closing quote/backtick, so inputs like `ab` `` or `"a""b"` will be tokenized incorrectly and likely break downstream parsing. Please update both scanners to correctly consume doubled-quote escapes; if you want to keep `StringRef` zero-copy, at minimum you should consume the full token without prematurely terminating (even if unescaping is deferred/materialized later).

Copilot uses AI. Check for mistakes.
Comment on lines +86 to +105
Token Parser<D>::read_table_name(StringRef& schema_out) {
Token name = tokenizer_.next_token();
if (name.type != TokenType::TK_IDENTIFIER &&
name.type != TokenType::TK_EOF) {
// Keywords used as table names (e.g., CREATE TABLE `user`)
// The tokenizer returns keyword tokens for reserved words.
// Accept any non-punctuation token as a potential name.
}

// Check for qualified name: schema.table
if (tokenizer_.peek().type == TokenType::TK_DOT) {
schema_out = name.text;
tokenizer_.skip(); // consume dot
Token table = tokenizer_.next_token();
return table;
}

schema_out = StringRef{};
return name;
}
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment says 'Accept any non-punctuation token as a potential name', but the code currently accepts any token type (including punctuation/operators) with no filtering. This can produce clearly incorrect metadata—for example, INSERT INTO (SELECT ...) would treat ( as the table token and set table_name to a 1-byte punctuation token. Please implement an actual validation step here (e.g., allow TK_IDENTIFIER and a restricted set of keyword tokens usable as identifiers, but reject punctuation/operators/literals and return an ERROR/PARTIAL result or an empty table_name).

Copilot uses AI. Check for mistakes.
Comment on lines +1 to +3
CXX = g++
CXXFLAGS = -std=c++17 -Wall -Wextra -g -O2
CPPFLAGS = -I./include -I./third_party/googletest/googletest/include
Copy link

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR description focuses on the new SQL parser foundation, but the diff also vendors a full GoogleTest/GoogleMock distribution and adds new build plumbing to compile it from source. Please reflect this explicitly in the PR description (or consider splitting the GoogleTest vendoring into a separate PR) so reviewers can assess third-party import/licensing/update implications independently from the parser changes.

Copilot uses AI. Check for mistakes.
Addresses review findings:
- Handle NOT IN / NOT BETWEEN / NOT LIKE as compound infix operators
- Add CASE/WHEN expression parsing
- Support := assignment in SET parser
- Move implementation includes to parser.cpp (not parser.h)
- Add tests for all above + zero-arg functions
@renecannao
Copy link
Copy Markdown
Collaborator Author

Plan 2 Complete: Expression Parser + SET Deep Parser

Added

  • Pratt expression parser (expression_parser.h) — handles literals, identifiers, binary/unary operators with correct precedence, function calls, IS [NOT] NULL, BETWEEN, IN, NOT IN/BETWEEN/LIKE, CASE/WHEN, subqueries, user/system variables
  • SET deep parser (set_parser.h) — full AST for all SET variants:
    • SET NAMES charset [COLLATE collation]
    • SET CHARACTER SET / CHARSET
    • SET [GLOBAL|SESSION] TRANSACTION [READ ONLY|WRITE|ISOLATION LEVEL ...]
    • SET [GLOBAL|SESSION|@@...] var = expr [, var = expr, ...]
    • SET @user_var := expr (MySQL := syntax)
    • PostgreSQL: SET name TO value, SET LOCAL
  • parse_set() upgraded from PARTIAL stub to full OK parser with AST

Test Results

  • 113 tests passing (35 expression + 21 SET + 57 foundation)
  • Zero compiler warnings

Runs on Ubuntu 22.04/24.04 with g++ and clang++, plus macOS.
Builds libsqlparser.a and runs all tests via Makefile.new.
Fixes from review:
- is_alias_start() default returns true (not false)
- TK_UNION/TK_OF/EXISTS added before select_parser.h compiles
- EXISTS expression handling added to expression parser
@renecannao
Copy link
Copy Markdown
Collaborator Author

Plan 3 Complete: SELECT Deep Parser

Added

  • SELECT deep parser (select_parser.h) — full AST for all SELECT clauses:
    • SELECT options (DISTINCT, ALL, SQL_CALC_FOUND_ROWS)
    • Select item list with aliases (explicit AS and implicit)
    • FROM with table references, qualified names, subqueries
    • JOINs (INNER, LEFT, RIGHT, FULL, CROSS, NATURAL, with ON/USING)
    • WHERE, GROUP BY, HAVING
    • ORDER BY (ASC/DESC)
    • LIMIT (OFFSET, MySQL comma syntax)
    • FOR UPDATE/SHARE (NOWAIT, SKIP LOCKED, OF)
    • INTO OUTFILE/DUMPFILE (MySQL)
  • EXISTS (subquery) expression support
  • TK_UNION, TK_OF tokens added

Test Results

  • 184 tests passing (was 134)
  • Zero compiler warnings
  • CI: all 5 matrix jobs pass

Normalizations documented:
- CHARSET -> CHARACTER SET in output
- TO -> = for PostgreSQL SET assignments
- ISOLATION LEVEL re-inserted by emitter from isolation level value
@renecannao
Copy link
Copy Markdown
Collaborator Author

Plan 4 Complete: Query Emitter

Added

  • StringBuilder (string_builder.h) — arena-backed string builder for emitter output
  • Emitter (emitter.h) — dialect-templated AST → SQL reconstruction
    • Handles all SET and SELECT node types + all expression types
    • String literals re-quoted, binary ops with spaces, function calls with parens
    • SET TRANSACTION re-inserts ISOLATION LEVEL keywords
    • Normalizations: CHARSET→CHARACTER SET, TO→= (PostgreSQL)
  • AST modification test — parse SET, modify value in AST, emit modified SQL
  • Arena& arena() accessor on Parser for emitter integration

Test Results

  • 222 tests passing (was 184)
  • All 24 round-trip tests pass (SET + SELECT + expressions)
  • Zero compiler warnings

Current Progress

Plan Status Tests
Plan 1: Foundation Done 57
Plan 2: Expression + SET Done +77 = 134
Plan 3: SELECT Done +50 = 184
Plan 4: Emitter Done +38 = 222
Plan 5: Prepared stmt cache Next
Plan 6: Benchmarks Pending

Plan 5: Prepared statement cache with LRU, deep-copy AST, bindings-aware emitter
Plan 6: Google Benchmark performance tests for all parser operations
@renecannao
Copy link
Copy Markdown
Collaborator Author

Plans 5 & 6 Complete: Prepared Statement Cache + Benchmarks

Plan 5: Prepared Statement Cache

  • StmtCache — fixed-capacity LRU, deep-copies AST from arena to heap
  • parse_and_cache() — parse + store in cache atomically
  • execute() — lookup cached AST, attach parameter bindings
  • prepare_cache_evict() — remove from cache
  • Bindings-aware emitter — materializes ? placeholders with bound values (INT, STRING, NULL, FLOAT, etc.)

Plan 6: Performance Benchmarks

All operations benchmarked with Google Benchmark (-O2, debug build):

Operation Latency Target Status
Tier 2 classify (BEGIN) 39 ns <100ns MET
Tier 2 classify (INSERT) 136 ns <100ns Close*
SET parse (simple) 127 ns <300ns MET
SET parse (multi-var) 239 ns <300ns MET
SELECT parse (simple) 235 ns <500ns MET
SELECT parse (complex) 1.4 us <2us MET
Emit SET (round-trip) 114 ns <500ns MET
Emit SELECT (round-trip) 251 ns <500ns MET
Arena reset 4 ns <10ns MET

*Tier 2 DML classification slightly over target in debug build; release build (-O3, no -g) should meet it.

Final State

  • 236 tests passing, zero warnings
  • 18 benchmarks covering all spec targets
  • All 6 plans complete

…and query digest

Covers full MySQL + PostgreSQL syntax for INSERT, UPDATE, DELETE deep parsers,
compound queries with INTERSECT precedence, and AST-based query digest with
token-level fallback for Tier 2 statements.
…g, UPDATE AST

- Extract TableRefParser from SelectParser as prerequisite refactoring
- CompoundQueryParser as separate layer above SelectParser
- UPDATE AST uses single NODE_FROM_CLAUSE with positional disambiguation
- Document classifier switch updates and is_alias_start blocklist changes
- Enumerate digest-mode emit method behavior changes
- DEFAULT VALUES uses two-token approach (TK_DEFAULT + TK_VALUES)
Plan 7: TableRefParser extraction + INSERT/REPLACE deep parser (closes #5)
Plan 8: UPDATE deep parser with multi-table JOIN + FROM (closes #6)
Plan 9: DELETE deep parser with multi-table forms (closes #7)
Plan 10: CompoundQueryParser with UNION/INTERSECT/EXCEPT (closes #8)
Plan 11: Query digest with AST + token-level modes (closes #9)
Moves parse_from_clause(), parse_table_reference(), parse_join(),
parse_optional_alias(), is_join_start(), and is_alias_start() into a
standalone TableRefParser<D> utility class. SelectParser delegates to
TableRefParser internally. All 236 existing tests pass unchanged.
New tokens: TK_DELAYED, TK_HIGH_PRIORITY, TK_DUPLICATE, TK_KEY,
TK_CONFLICT, TK_DO, TK_NOTHING, TK_RETURNING, TK_CONSTRAINT.
New node types: NODE_INSERT_STMT, NODE_INSERT_COLUMNS, NODE_VALUES_CLAUSE,
NODE_VALUES_ROW, NODE_INSERT_SET_CLAUSE, NODE_ON_DUPLICATE_KEY,
NODE_ON_CONFLICT, NODE_CONFLICT_TARGET, NODE_CONFLICT_ACTION,
NODE_RETURNING_CLAUSE, NODE_UPDATE_SET_ITEM, NODE_STMT_OPTIONS.
Updated is_alias_start() blocklist and keyword-as-identifier list.
InsertParser<D> handles full MySQL and PostgreSQL INSERT/REPLACE syntax:
- MySQL: INSERT/REPLACE [LOW_PRIORITY|DELAYED|HIGH_PRIORITY] [IGNORE]
  [INTO] table [(cols)] {VALUES|SELECT|SET} [ON DUPLICATE KEY UPDATE]
- PostgreSQL: INSERT INTO table [(cols)] {VALUES|SELECT|DEFAULT VALUES}
  [ON CONFLICT ...] [RETURNING ...]

Emitter extended with all INSERT-related node types. Classifier updated
to route TK_INSERT and TK_REPLACE to parse_insert(). Backward
compatibility maintained for table_name/schema_name extraction.
32 new tests covering MySQL and PostgreSQL INSERT/REPLACE:
- Basic VALUES, multi-row, column list, INSERT SELECT, INSERT SET
- ON DUPLICATE KEY UPDATE, IGNORE, LOW_PRIORITY, DELAYED, HIGH_PRIORITY
- REPLACE with various options
- PostgreSQL ON CONFLICT DO NOTHING/UPDATE, RETURNING, DEFAULT VALUES
- Bulk data-driven tests (24 MySQL + 16 PostgreSQL variants)
- Round-trip emitter tests (7 MySQL + 5 PostgreSQL)

Also fixes is_alias_start() to block TK_LPAREN which was incorrectly
treated as a valid implicit alias token.

Total test count: 268 (was 236, +32 new)
New node types for the UPDATE deep parser. NODE_UPDATE_SET_ITEM and
NODE_STMT_OPTIONS already exist from Plan 7.
Supports MySQL (LOW_PRIORITY, IGNORE, multi-table JOINs, ORDER BY,
LIMIT) and PostgreSQL (ONLY, alias, FROM clause, RETURNING). Uses
ExpressionParser for expressions and TableRefParser for table refs.
Add emit_update_stmt and emit_update_set_clause to emitter. Replace
extract_update() Tier 2 extractor with parse_update() Tier 1 parser
in classifier dispatch. Handles MySQL multi-table emission without
FROM keyword and PostgreSQL FROM clause with keyword.
23 tests covering MySQL (simple, multi-column, options, ORDER BY,
LIMIT, multi-table JOIN, comma join, LEFT JOIN) and PostgreSQL
(simple, FROM, multi-table FROM, RETURNING, alias). Includes bulk
data-driven tests (17 MySQL + 9 PostgreSQL variants) and round-trip
emitter tests for both dialects.
Plan 9 Task 1: Add DELETE-specific node types to the NodeType enum
for the upcoming DELETE deep parser.
Plan 9 Task 2: Implement DeleteParser<D> header-only template supporting:
- MySQL single-table: DELETE [LOW_PRIORITY] [QUICK] [IGNORE] FROM table [WHERE] [ORDER BY] [LIMIT]
- MySQL multi-table form 1: DELETE t1, t2 FROM table_refs [WHERE]
- MySQL multi-table form 2: DELETE FROM t1, t2 USING table_refs [WHERE]
- PostgreSQL: DELETE FROM [ONLY] table [[AS] alias] [USING using_list] [WHERE] [RETURNING]
Plan 9 Task 3: Add emit_delete_stmt/emit_delete_using to emitter,
add parse_delete() to Parser, replace extract_delete() dispatch with
parse_delete() in classifier.
Plan 9 Task 4: 25 new tests covering MySQL (simple, options, ORDER BY/LIMIT,
multi-table form 1 and form 2), PostgreSQL (USING, RETURNING, aliases),
bulk data-driven tests (14 MySQL + 9 PostgreSQL variants), and round-trip
emission tests. All 316 tests pass.
…OPERATION node types

Register INTERSECT and EXCEPT keywords in both MySQL and PostgreSQL keyword
tables. Update is_alias_start() blocklist to prevent these keywords from
being misinterpreted as implicit aliases.
Implements Pratt-style precedence parsing where INTERSECT binds tighter
than UNION/EXCEPT. Handles parenthesized nesting, optional ALL modifier,
and trailing ORDER BY/LIMIT on compound results. Returns bare
NODE_SELECT_STMT when no set operator is present.
Update parse_select() to use CompoundQueryParser instead of SelectParser
directly. Add emit_compound_query and emit_set_operation to emitter. Handle
parenthesized SELECT at classifier level for queries starting with '('.
Move FLAG_SET_OP_ALL to common.h for shared access.
Create test_compound.cpp with 22 tests covering UNION, UNION ALL,
INTERSECT, INTERSECT ALL, EXCEPT, EXCEPT ALL, precedence verification,
parenthesized nesting, trailing ORDER BY/LIMIT, and round-trip emission.

Add compound_mode flag to SelectParser that prevents eager consumption of
ORDER BY/LIMIT/FOR clauses, allowing CompoundQueryParser to claim them as
compound-level clauses. Fix parse_operand to consume SELECT keyword when
present for subsequent SELECTs in compound expressions.
Implement AST-based and token-level query digest that normalizes SQL
queries for rules matching. Literals become ?, IN lists collapse to
IN (?), VALUES rows collapse to a single row, aliases are skipped,
and keywords are uppercased. Both paths produce a normalized string
and 64-bit FNV-1a hash.

- Add EmitMode::DIGEST to Emitter with modified literal/IN/VALUES/alias
  emission
- Create digest.h with Digest<D> class, DigestResult, FnvHash
- Token-level fallback works for all statement types (Tier 2 included)
- 34 new tests covering literal normalization, IN collapsing, VALUES
  collapsing, keyword uppercasing, hash consistency, cross-tier
  consistency, placeholder passthrough, NULL preservation
@renecannao
Copy link
Copy Markdown
Collaborator Author

Plans 7-11 Complete: All Features Implemented

Plan 7: INSERT/REPLACE Deep Parser (closes #5)

  • TableRefParser extracted as shared utility
  • Full MySQL: VALUES, SELECT, SET form, ON DUPLICATE KEY UPDATE, IGNORE, LOW_PRIORITY, REPLACE
  • Full PostgreSQL: VALUES, SELECT, DEFAULT VALUES, ON CONFLICT DO UPDATE/NOTHING, RETURNING

Plan 8: UPDATE Deep Parser (closes #6)

  • MySQL: multi-table with JOINs, LOW_PRIORITY/IGNORE, ORDER BY + LIMIT
  • PostgreSQL: UPDATE ... FROM, RETURNING

Plan 9: DELETE Deep Parser (closes #7)

  • MySQL: single-table, multi-table form 1 (DELETE t1 FROM...), form 2 (DELETE FROM t1 USING...), LOW_PRIORITY/QUICK/IGNORE, ORDER BY + LIMIT
  • PostgreSQL: DELETE ... USING, RETURNING

Plan 10: Compound Queries (closes #8)

  • UNION [ALL], INTERSECT [ALL], EXCEPT [ALL]
  • Pratt-style precedence: INTERSECT binds tighter than UNION/EXCEPT
  • Parenthesized nesting: (SELECT ...) UNION (SELECT ... INTERSECT SELECT ...)
  • Trailing ORDER BY / LIMIT on compound result

Plan 11: Query Digest (closes #9)

  • AST-based normalization: literals → ?, IN list collapsing, keyword uppercasing
  • Token-level fallback for Tier 2 statements (CREATE, DROP, GRANT, etc.)
  • 64-bit FNV-1a hash for fast query rules matching
  • VALUES row collapsing for INSERT digests

Final State

Metric Value
Tests 372 passing
Test files 14
Header files 16
Benchmarks 18
Compiler warnings 0
CI All 5 matrix jobs pass

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Implement SQL parser foundation: core types, arena, tokenizer, classifier

2 participants