Add SQL parser foundation: core types, arena, tokenizer, classifier by renecannao · Pull Request #4 · ProxySQL/ParserSQL

renecannao · 2026-03-24T02:21:09Z

Summary

Implements the foundational layers of a new hand-written recursive descent SQL parser for ProxySQL, replacing the existing Flex/Bison POC.

Core types: StringRef (zero-copy), Dialect enum, StmtType, NodeType, TokenType
Arena allocator: Block-chained, 64KB default, 1MB max, O(1) reset, 8-byte aligned
AstNode: 32-byte compact struct with intrusive linked-list children
Tokenizer: Dialect-templated (MySQL + PostgreSQL), pull-based, zero-copy, keyword binary search
Classifier: Switch dispatch on first token, routes to Tier 1 stubs or Tier 2 extractors
Tier 2 extractors: Table name/schema extraction for DML/DDL, transaction classification, USE database

Key design decisions

Compile-time dialect dispatch via if constexpr (zero runtime overhead)
Header-only tokenizer for maximum inlining
Tier 1 (SELECT, SET) return PARTIAL — deep parsers in follow-up PR
ParseResult::remaining for multi-statement query support

Performance targets (validated in follow-up)

Tier 2 classification: <100ns
Arena reset: <10ns

Closes #3

Test plan

57 unit tests across 5 suites (arena, MySQL tokenizer, PgSQL tokenizer, MySQL classifier, PgSQL classifier)
Zero compiler warnings with -Wall -Wextra
Clean build from scratch via make -f Makefile.new clean && make -f Makefile.new all

Hand-written recursive descent parser with two-tier architecture: deep parse for SELECT/SET, classify+extract for everything else. Targets sub-microsecond latency with arena allocation and zero-copy.

…tement support - Define arena block-chaining strategy (never realloc, overflow blocks freed on reset) - Fix AstNode field ordering to achieve 32 bytes with static_assert - Add multi-statement query handling via ParseResult::remaining - Clarify PARTIAL semantics for both tiers - Fix BoundValue: add DATETIME/DECIMAL types, separate float32/float64 - Replace prepare_cache_store with parse_and_cache for safe API - Add threading note for session migration across threads - Correct RETURNING clause attribution (DML, not SELECT) - Mark cross-dialect emission as out of scope - Mandate StringRef as trivially copyable with static_assert - Add ErrorInfo lifetime documentation - Add max query length / arena size documentation

- Fix QualifiedIdentifier test to use non-keyword identifiers - Fix scan_to_end to use correct remaining calculation from the start - Extract ci_cmp to common.h, remove duplicates from keyword headers - Add NAMES keyword to PostgreSQL keyword table - Add missing tests for REPLACE, GRANT, REVOKE, LOCK, DEALLOCATE - Add input_end() to Tokenizer from the start

…aping - Mark unused `first` parameter in 12 Tier 2 extractors as /* first */ - Handle '' (doubled single-quote) escape in scan_single_quoted_string() - Fix unused variable warning in AllocateAligned test - Build now produces zero warnings with -Wall -Wextra

Copilot

Pull request overview

Adds the initial implementation of a dialect-templated SQL tokenizer + classifier/extractor parser (with arena allocation and core types), along with a new unit-test harness. The PR also vendors a full GoogleTest/GoogleMock distribution under third_party/ to support those tests.

Changes:

Introduces core parser components: StringRef, Arena, Tokenizer<Dialect>, Parser<Dialect>, ParseResult, keyword tables, and a compact AstNode.
Adds unit tests for arena, tokenizer, and classifier behavior (MySQL + PostgreSQL).
Vendors GoogleTest/GoogleMock + associated docs/CI/build files into third_party/googletest/.

Reviewed changes

Copilot reviewed 65 out of 264 changed files in this pull request and generated 5 comments.

Show a summary per file

File	Description
third_party/googletest/googlemock/include/gmock/gmock-cardinalities.h	Vendored GoogleMock header (test dependency).
third_party/googletest/googlemock/docs/README.md	Vendored GoogleMock docs pointer.
third_party/googletest/googlemock/cmake/gmock_main.pc.in	Vendored pkg-config template for gmock_main.
third_party/googletest/googlemock/cmake/gmock.pc.in	Vendored pkg-config template for gmock.
third_party/googletest/googlemock/README.md	Vendored GoogleMock README.
third_party/googletest/googlemock/CMakeLists.txt	Vendored GoogleMock CMake build.
third_party/googletest/docs/samples.md	Vendored GoogleTest docs.
third_party/googletest/docs/reference/mocking.md	Vendored GoogleTest docs.
third_party/googletest/docs/reference/matchers.md	Vendored GoogleTest docs.
third_party/googletest/docs/reference/actions.md	Vendored GoogleTest docs.
third_party/googletest/docs/quickstart-cmake.md	Vendored GoogleTest docs.
third_party/googletest/docs/quickstart-bazel.md	Vendored GoogleTest docs.
third_party/googletest/docs/primer.md	Vendored GoogleTest docs.
third_party/googletest/docs/platforms.md	Vendored GoogleTest docs.
third_party/googletest/docs/pkgconfig.md	Vendored GoogleTest docs.
third_party/googletest/docs/index.md	Vendored GoogleTest docs index.
third_party/googletest/docs/gmock_faq.md	Vendored GoogleMock FAQ.
third_party/googletest/docs/gmock_cheat_sheet.md	Vendored GoogleMock cheat sheet.
third_party/googletest/docs/community_created_documentation.md	Vendored GoogleTest community docs listing.
third_party/googletest/docs/assets/css/style.scss	Vendored docs site styling.
third_party/googletest/docs/_sass/main.scss	Vendored docs site styling.
third_party/googletest/docs/_layouts/default.html	Vendored docs site layout.
third_party/googletest/docs/_data/navigation.yml	Vendored docs site navigation.
third_party/googletest/docs/_config.yml	Vendored docs site config.
third_party/googletest/ci/windows-presubmit.bat	Vendored upstream CI script.
third_party/googletest/ci/macos-presubmit.sh	Vendored upstream CI script.
third_party/googletest/ci/linux-presubmit.sh	Vendored upstream CI script.
third_party/googletest/WORKSPACE	Vendored Bazel workspace config.
third_party/googletest/README.md	Vendored GoogleTest README.
third_party/googletest/LICENSE	Vendored GoogleTest license.
third_party/googletest/CONTRIBUTORS	Vendored contributor list.
third_party/googletest/CONTRIBUTING.md	Vendored contributing guidelines.
third_party/googletest/CMakeLists.txt	Vendored top-level CMake entry.
third_party/googletest/BUILD.bazel	Vendored Bazel build file.
third_party/googletest/.gitignore	Vendored ignore patterns.
third_party/googletest/.github/workflows/gtest-ci.yml	Vendored GitHub Actions config.
third_party/googletest/.github/ISSUE_TEMPLATE/config.yml	Vendored issue template config.
third_party/googletest/.github/ISSUE_TEMPLATE/10-feature_request.yml	Vendored issue template.
third_party/googletest/.github/ISSUE_TEMPLATE/00-bug_report.yml	Vendored issue template.
third_party/googletest/.clang-format	Vendored clang-format config.
tests/test_tokenizer.cpp	Adds tokenizer unit tests for both dialects.
tests/test_main.cpp	Adds a gtest `main()` for the suite.
tests/test_classifier.cpp	Adds classifier/extractor tests for both dialects.
tests/test_arena.cpp	Adds arena allocator unit tests.
src/sql_parser/parser.cpp	Implements dialect-templated parser classification + tier-2 extractors.
src/sql_parser/arena.cpp	Implements arena allocator (block-chained, reset frees overflow).
include/sql_parser/tokenizer.h	Implements dialect-templated pull tokenizer with keyword lookup.
include/sql_parser/token.h	Defines `TokenType` and `Token`.
include/sql_parser/parser.h	Declares `Parser<D>` public API and extractor methods.
include/sql_parser/parse_result.h	Defines `ParseResult` and extracted metadata fields.
include/sql_parser/keywords_pgsql.h	PostgreSQL keyword table + binary-search lookup.
include/sql_parser/keywords_mysql.h	MySQL keyword table + binary-search lookup.
include/sql_parser/common.h	Defines `Dialect`, `StringRef`, `StmtType`, `NodeType`, `ci_cmp`.
include/sql_parser/ast.h	Defines compact `AstNode` and arena `make_node`.
include/sql_parser/arena.h	Declares arena allocator API.
docs/superpowers/specs/2026-03-24-sql-parser-design.md	Adds design spec for the new parser architecture.
Makefile.new	Adds build/test wiring for the new library and gtest-based unit tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-24T02:23:43Z

src/sql_parser/arena.cpp

+Arena::Arena(size_t block_size, size_t max_size)
+    : block_size_(block_size), max_size_(max_size), total_allocated_(0) {
+    primary_ = allocate_block(block_size_);
+    current_ = primary_;
+    total_allocated_ = block_size_;
+}


If allocate_block(block_size_) fails, primary_/current_ become null, but allocate() later dereferences current_ unconditionally. Please handle initial allocation failure (e.g., set primary_/current_ to nullptr, keep total_allocated_ at 0, and have allocate()/reset() gracefully return nullptr/no-op when current_/primary_ is null). Also consider validating max_size_ >= block_size_ (or clamping/adjusting) to keep accounting consistent.

Copilot · 2026-03-24T02:23:43Z

include/sql_parser/tokenizer.h

+            // -- line comment (MySQL requires space after --, PgSQL doesn't but we handle both)
+            if (c == '-' && peek_char(1) == '-') {
+                cursor_ += 2;
+                while (cursor_ < end_ && *cursor_ != '\n') ++cursor_;
+                continue;


This treats -- as a line comment unconditionally for all dialects. In MySQL, -- starts a comment only when followed by whitespace/control (or end of input). As written, valid expressions like SELECT 1--1 will be tokenized incorrectly (the --1 becomes a comment). Please make the -- recognition dialect-correct: always comment for PostgreSQL, but for MySQL require a valid comment terminator after the second - (space/tab/newline/\r/\0, etc.). Add a unit test covering 1--1 in MySQL to prevent regressions.

Suggested change

// -- line comment (MySQL requires space after --, PgSQL doesn't but we handle both)

if (c == '-' && peek_char(1) == '-') {

cursor_ += 2;

while (cursor_ < end_ && *cursor_ != '\n') ++cursor_;

continue;

// -- line comment

if (c == '-' && peek_char(1) == '-') {

if constexpr (D == Dialect::PostgreSQL) {

// PostgreSQL: -- always starts a line comment

cursor_ += 2;

while (cursor_ < end_ && *cursor_ != '\n') ++cursor_;

continue;

} else {

// MySQL: -- starts a comment only if followed by whitespace/control or end-of-input

char after2 = peek_char(2);

if (after2 == ' ' || after2 == '\t' || after2 == '\n' ||

after2 == '\r' || after2 == '\0') {

cursor_ += 2;

while (cursor_ < end_ && *cursor_ != '\n') ++cursor_;

continue;

}

}

Copilot · 2026-03-24T02:23:44Z

include/sql_parser/tokenizer.h

+    // MySQL: backtick-quoted identifier
+    Token scan_backtick_identifier() {
+        ++cursor_;  // skip opening backtick
+        const char* content_start = cursor_;
+        while (cursor_ < end_ && *cursor_ != '`') ++cursor_;
+        uint32_t len = static_cast<uint32_t>(cursor_ - content_start);
+        if (cursor_ < end_) ++cursor_;  // skip closing backtick
+        return make_token(TokenType::TK_IDENTIFIER, content_start, len);
+    }
+
+    // PostgreSQL: double-quoted identifier
+    Token scan_double_quoted_identifier() {
+        ++cursor_;  // skip opening quote
+        const char* content_start = cursor_;
+        while (cursor_ < end_ && *cursor_ != '"') ++cursor_;
+        uint32_t len = static_cast<uint32_t>(cursor_ - content_start);
+        if (cursor_ < end_) ++cursor_;  // skip closing quote
+        return make_token(TokenType::TK_IDENTIFIER, content_start, len);
+    }


Quoted identifiers can contain escaped quote characters (MySQL: doubled backtick ```` inside backticks; PostgreSQL: doubled quote \"\" inside quoted identifiers). These scanners stop at the first closing quote/backtick, so inputs like `ab` `` or `"a""b"` will be tokenized incorrectly and likely break downstream parsing. Please update both scanners to correctly consume doubled-quote escapes; if you want to keep `StringRef` zero-copy, at minimum you should consume the full token without prematurely terminating (even if unescaping is deferred/materialized later).

Copilot · 2026-03-24T02:23:44Z

src/sql_parser/parser.cpp

+Token Parser<D>::read_table_name(StringRef& schema_out) {
+    Token name = tokenizer_.next_token();
+    if (name.type != TokenType::TK_IDENTIFIER &&
+        name.type != TokenType::TK_EOF) {
+        // Keywords used as table names (e.g., CREATE TABLE `user`)
+        // The tokenizer returns keyword tokens for reserved words.
+        // Accept any non-punctuation token as a potential name.
+    }
+
+    // Check for qualified name: schema.table
+    if (tokenizer_.peek().type == TokenType::TK_DOT) {
+        schema_out = name.text;
+        tokenizer_.skip();  // consume dot
+        Token table = tokenizer_.next_token();
+        return table;
+    }
+
+    schema_out = StringRef{};
+    return name;
+}


The comment says 'Accept any non-punctuation token as a potential name', but the code currently accepts any token type (including punctuation/operators) with no filtering. This can produce clearly incorrect metadata—for example, INSERT INTO (SELECT ...) would treat ( as the table token and set table_name to a 1-byte punctuation token. Please implement an actual validation step here (e.g., allow TK_IDENTIFIER and a restricted set of keyword tokens usable as identifiers, but reject punctuation/operators/literals and return an ERROR/PARTIAL result or an empty table_name).

Copilot · 2026-03-24T02:23:44Z

Makefile.new

+CXX = g++
+CXXFLAGS = -std=c++17 -Wall -Wextra -g -O2
+CPPFLAGS = -I./include -I./third_party/googletest/googletest/include


The PR description focuses on the new SQL parser foundation, but the diff also vendors a full GoogleTest/GoogleMock distribution and adds new build plumbing to compile it from source. Please reflect this explicitly in the PR description (or consider splitting the GoogleTest vendoring into a separate PR) so reviewers can assess third-party import/licensing/update implications independently from the parser changes.

Addresses review findings: - Handle NOT IN / NOT BETWEEN / NOT LIKE as compound infix operators - Add CASE/WHEN expression parsing - Support := assignment in SET parser - Move implementation includes to parser.cpp (not parser.h) - Add tests for all above + zero-arg functions

…rators

renecannao · 2026-03-24T08:29:05Z

Plan 2 Complete: Expression Parser + SET Deep Parser

Added

Pratt expression parser (expression_parser.h) — handles literals, identifiers, binary/unary operators with correct precedence, function calls, IS [NOT] NULL, BETWEEN, IN, NOT IN/BETWEEN/LIKE, CASE/WHEN, subqueries, user/system variables
SET deep parser (set_parser.h) — full AST for all SET variants:
- SET NAMES charset [COLLATE collation]
- SET CHARACTER SET / CHARSET
- SET [GLOBAL|SESSION] TRANSACTION [READ ONLY|WRITE|ISOLATION LEVEL ...]
- SET [GLOBAL|SESSION|@@...] var = expr [, var = expr, ...]
- SET @user_var := expr (MySQL := syntax)
- PostgreSQL: SET name TO value, SET LOCAL
parse_set() upgraded from PARTIAL stub to full OK parser with AST

Test Results

113 tests passing (35 expression + 21 SET + 57 foundation)
Zero compiler warnings

Runs on Ubuntu 22.04/24.04 with g++ and clang++, plus macOS. Builds libsqlparser.a and runs all tests via Makefile.new.

Fixes from review: - is_alias_start() default returns true (not false) - TK_UNION/TK_OF/EXISTS added before select_parser.h compiles - EXISTS expression handling added to expression parser

…BY, LIMIT, locking

renecannao · 2026-03-24T09:17:39Z

Plan 3 Complete: SELECT Deep Parser

Added

SELECT deep parser (select_parser.h) — full AST for all SELECT clauses:
- SELECT options (DISTINCT, ALL, SQL_CALC_FOUND_ROWS)
- Select item list with aliases (explicit AS and implicit)
- FROM with table references, qualified names, subqueries
- JOINs (INNER, LEFT, RIGHT, FULL, CROSS, NATURAL, with ON/USING)
- WHERE, GROUP BY, HAVING
- ORDER BY (ASC/DESC)
- LIMIT (OFFSET, MySQL comma syntax)
- FOR UPDATE/SHARE (NOWAIT, SKIP LOCKED, OF)
- INTO OUTFILE/DUMPFILE (MySQL)
EXISTS (subquery) expression support
TK_UNION, TK_OF tokens added

Test Results

184 tests passing (was 134)
Zero compiler warnings
CI: all 5 matrix jobs pass

Normalizations documented: - CHARSET -> CHARACTER SET in output - TO -> = for PostgreSQL SET assignments - ISOLATION LEVEL re-inserted by emitter from isolation level value

renecannao · 2026-03-24T10:13:17Z

Plan 4 Complete: Query Emitter

Added

StringBuilder (string_builder.h) — arena-backed string builder for emitter output
Emitter (emitter.h) — dialect-templated AST → SQL reconstruction
- Handles all SET and SELECT node types + all expression types
- String literals re-quoted, binary ops with spaces, function calls with parens
- SET TRANSACTION re-inserts ISOLATION LEVEL keywords
- Normalizations: CHARSET→CHARACTER SET, TO→= (PostgreSQL)
AST modification test — parse SET, modify value in AST, emit modified SQL
Arena& arena() accessor on Parser for emitter integration

Test Results

222 tests passing (was 184)
All 24 round-trip tests pass (SET + SELECT + expressions)
Zero compiler warnings

Current Progress

Plan	Status	Tests
Plan 1: Foundation	Done	57
Plan 2: Expression + SET	Done	+77 = 134
Plan 3: SELECT	Done	+50 = 184
Plan 4: Emitter	Done	+38 = 222
Plan 5: Prepared stmt cache	Next	—
Plan 6: Benchmarks	Pending	—

Plan 5: Prepared statement cache with LRU, deep-copy AST, bindings-aware emitter Plan 6: Google Benchmark performance tests for all parser operations

… bindings-aware emitter

renecannao · 2026-03-24T10:52:51Z

Plans 5 & 6 Complete: Prepared Statement Cache + Benchmarks

Plan 5: Prepared Statement Cache

StmtCache — fixed-capacity LRU, deep-copies AST from arena to heap
parse_and_cache() — parse + store in cache atomically
execute() — lookup cached AST, attach parameter bindings
prepare_cache_evict() — remove from cache
Bindings-aware emitter — materializes ? placeholders with bound values (INT, STRING, NULL, FLOAT, etc.)

Plan 6: Performance Benchmarks

All operations benchmarked with Google Benchmark (-O2, debug build):

Operation	Latency	Target	Status
Tier 2 classify (BEGIN)	39 ns	<100ns	MET
Tier 2 classify (INSERT)	136 ns	<100ns	Close*
SET parse (simple)	127 ns	<300ns	MET
SET parse (multi-var)	239 ns	<300ns	MET
SELECT parse (simple)	235 ns	<500ns	MET
SELECT parse (complex)	1.4 us	<2us	MET
Emit SET (round-trip)	114 ns	<500ns	MET
Emit SELECT (round-trip)	251 ns	<500ns	MET
Arena reset	4 ns	<10ns	MET

*Tier 2 DML classification slightly over target in debug build; release build (-O3, no -g) should meet it.

Final State

236 tests passing, zero warnings
18 benchmarks covering all spec targets
All 6 plans complete

…and query digest Covers full MySQL + PostgreSQL syntax for INSERT, UPDATE, DELETE deep parsers, compound queries with INTERSECT precedence, and AST-based query digest with token-level fallback for Tier 2 statements.

…g, UPDATE AST - Extract TableRefParser from SelectParser as prerequisite refactoring - CompoundQueryParser as separate layer above SelectParser - UPDATE AST uses single NODE_FROM_CLAUSE with positional disambiguation - Document classifier switch updates and is_alias_start blocklist changes - Enumerate digest-mode emit method behavior changes - DEFAULT VALUES uses two-token approach (TK_DEFAULT + TK_VALUES)

Plan 7: TableRefParser extraction + INSERT/REPLACE deep parser (closes #5) Plan 8: UPDATE deep parser with multi-table JOIN + FROM (closes #6) Plan 9: DELETE deep parser with multi-table forms (closes #7) Plan 10: CompoundQueryParser with UNION/INTERSECT/EXCEPT (closes #8) Plan 11: Query digest with AST + token-level modes (closes #9)

Moves parse_from_clause(), parse_table_reference(), parse_join(), parse_optional_alias(), is_join_start(), and is_alias_start() into a standalone TableRefParser<D> utility class. SelectParser delegates to TableRefParser internally. All 236 existing tests pass unchanged.

New tokens: TK_DELAYED, TK_HIGH_PRIORITY, TK_DUPLICATE, TK_KEY, TK_CONFLICT, TK_DO, TK_NOTHING, TK_RETURNING, TK_CONSTRAINT. New node types: NODE_INSERT_STMT, NODE_INSERT_COLUMNS, NODE_VALUES_CLAUSE, NODE_VALUES_ROW, NODE_INSERT_SET_CLAUSE, NODE_ON_DUPLICATE_KEY, NODE_ON_CONFLICT, NODE_CONFLICT_TARGET, NODE_CONFLICT_ACTION, NODE_RETURNING_CLAUSE, NODE_UPDATE_SET_ITEM, NODE_STMT_OPTIONS. Updated is_alias_start() blocklist and keyword-as-identifier list.

InsertParser<D> handles full MySQL and PostgreSQL INSERT/REPLACE syntax: - MySQL: INSERT/REPLACE [LOW_PRIORITY|DELAYED|HIGH_PRIORITY] [IGNORE] [INTO] table [(cols)] {VALUES|SELECT|SET} [ON DUPLICATE KEY UPDATE] - PostgreSQL: INSERT INTO table [(cols)] {VALUES|SELECT|DEFAULT VALUES} [ON CONFLICT ...] [RETURNING ...] Emitter extended with all INSERT-related node types. Classifier updated to route TK_INSERT and TK_REPLACE to parse_insert(). Backward compatibility maintained for table_name/schema_name extraction.

32 new tests covering MySQL and PostgreSQL INSERT/REPLACE: - Basic VALUES, multi-row, column list, INSERT SELECT, INSERT SET - ON DUPLICATE KEY UPDATE, IGNORE, LOW_PRIORITY, DELAYED, HIGH_PRIORITY - REPLACE with various options - PostgreSQL ON CONFLICT DO NOTHING/UPDATE, RETURNING, DEFAULT VALUES - Bulk data-driven tests (24 MySQL + 16 PostgreSQL variants) - Round-trip emitter tests (7 MySQL + 5 PostgreSQL) Also fixes is_alias_start() to block TK_LPAREN which was incorrectly treated as a valid implicit alias token. Total test count: 268 (was 236, +32 new)

New node types for the UPDATE deep parser. NODE_UPDATE_SET_ITEM and NODE_STMT_OPTIONS already exist from Plan 7.

Supports MySQL (LOW_PRIORITY, IGNORE, multi-table JOINs, ORDER BY, LIMIT) and PostgreSQL (ONLY, alias, FROM clause, RETURNING). Uses ExpressionParser for expressions and TableRefParser for table refs.

Add emit_update_stmt and emit_update_set_clause to emitter. Replace extract_update() Tier 2 extractor with parse_update() Tier 1 parser in classifier dispatch. Handles MySQL multi-table emission without FROM keyword and PostgreSQL FROM clause with keyword.

23 tests covering MySQL (simple, multi-column, options, ORDER BY, LIMIT, multi-table JOIN, comma join, LEFT JOIN) and PostgreSQL (simple, FROM, multi-table FROM, RETURNING, alias). Includes bulk data-driven tests (17 MySQL + 9 PostgreSQL variants) and round-trip emitter tests for both dialects.

Plan 9 Task 1: Add DELETE-specific node types to the NodeType enum for the upcoming DELETE deep parser.

Plan 9 Task 2: Implement DeleteParser<D> header-only template supporting: - MySQL single-table: DELETE [LOW_PRIORITY] [QUICK] [IGNORE] FROM table [WHERE] [ORDER BY] [LIMIT] - MySQL multi-table form 1: DELETE t1, t2 FROM table_refs [WHERE] - MySQL multi-table form 2: DELETE FROM t1, t2 USING table_refs [WHERE] - PostgreSQL: DELETE FROM [ONLY] table [[AS] alias] [USING using_list] [WHERE] [RETURNING]

Plan 9 Task 3: Add emit_delete_stmt/emit_delete_using to emitter, add parse_delete() to Parser, replace extract_delete() dispatch with parse_delete() in classifier.

Plan 9 Task 4: 25 new tests covering MySQL (simple, options, ORDER BY/LIMIT, multi-table form 1 and form 2), PostgreSQL (USING, RETURNING, aliases), bulk data-driven tests (14 MySQL + 9 PostgreSQL variants), and round-trip emission tests. All 316 tests pass.

…OPERATION node types Register INTERSECT and EXCEPT keywords in both MySQL and PostgreSQL keyword tables. Update is_alias_start() blocklist to prevent these keywords from being misinterpreted as implicit aliases.

Implements Pratt-style precedence parsing where INTERSECT binds tighter than UNION/EXCEPT. Handles parenthesized nesting, optional ALL modifier, and trailing ORDER BY/LIMIT on compound results. Returns bare NODE_SELECT_STMT when no set operator is present.

Update parse_select() to use CompoundQueryParser instead of SelectParser directly. Add emit_compound_query and emit_set_operation to emitter. Handle parenthesized SELECT at classifier level for queries starting with '('. Move FLAG_SET_OP_ALL to common.h for shared access.

Create test_compound.cpp with 22 tests covering UNION, UNION ALL, INTERSECT, INTERSECT ALL, EXCEPT, EXCEPT ALL, precedence verification, parenthesized nesting, trailing ORDER BY/LIMIT, and round-trip emission. Add compound_mode flag to SelectParser that prevents eager consumption of ORDER BY/LIMIT/FOR clauses, allowing CompoundQueryParser to claim them as compound-level clauses. Fix parse_operand to consume SELECT keyword when present for subsequent SELECTs in compound expressions.

Implement AST-based and token-level query digest that normalizes SQL queries for rules matching. Literals become ?, IN lists collapse to IN (?), VALUES rows collapse to a single row, aliases are skipped, and keywords are uppercased. Both paths produce a normalized string and 64-bit FNV-1a hash. - Add EmitMode::DIGEST to Emitter with modified literal/IN/VALUES/alias emission - Create digest.h with Digest<D> class, DigestResult, FnvHash - Token-level fallback works for all statement types (Tier 2 included) - 34 new tests covering literal normalization, IN collapsing, VALUES collapsing, keyword uppercasing, hash consistency, cross-tier consistency, placeholder passthrough, NULL preservation

renecannao · 2026-03-24T15:59:37Z

Plans 7-11 Complete: All Features Implemented

Plan 7: INSERT/REPLACE Deep Parser (closes #5)

TableRefParser extracted as shared utility
Full MySQL: VALUES, SELECT, SET form, ON DUPLICATE KEY UPDATE, IGNORE, LOW_PRIORITY, REPLACE
Full PostgreSQL: VALUES, SELECT, DEFAULT VALUES, ON CONFLICT DO UPDATE/NOTHING, RETURNING

Plan 8: UPDATE Deep Parser (closes #6)

MySQL: multi-table with JOINs, LOW_PRIORITY/IGNORE, ORDER BY + LIMIT
PostgreSQL: UPDATE ... FROM, RETURNING

Plan 9: DELETE Deep Parser (closes #7)

MySQL: single-table, multi-table form 1 (DELETE t1 FROM...), form 2 (DELETE FROM t1 USING...), LOW_PRIORITY/QUICK/IGNORE, ORDER BY + LIMIT
PostgreSQL: DELETE ... USING, RETURNING

Plan 10: Compound Queries (closes #8)

UNION [ALL], INTERSECT [ALL], EXCEPT [ALL]
Pratt-style precedence: INTERSECT binds tighter than UNION/EXCEPT
Parenthesized nesting: (SELECT ...) UNION (SELECT ... INTERSECT SELECT ...)
Trailing ORDER BY / LIMIT on compound result

Plan 11: Query Digest (closes #9)

AST-based normalization: literals → ?, IN list collapsing, keyword uppercasing
Token-level fallback for Tier 2 statements (CREATE, DROP, GRANT, etc.)
64-bit FNV-1a hash for fast query rules matching
VALUES row collapsing for INSERT digests

Final State

Metric	Value
Tests	372 passing
Test files	14
Header files	16
Benchmarks	18
Compiler warnings	0
CI	All 5 matrix jobs pass

renecannao added 12 commits March 24, 2026 01:36

Add SQL parser design spec for ProxySQL integration

8967a98

Hand-written recursive descent parser with two-tier architecture: deep parse for SELECT/SET, classify+extract for everything else. Targets sub-microsecond latency with arena allocation and zero-copy.

feat: add new build system and test infrastructure for sql_parser

5257914

feat: add core types — StringRef, Dialect, StmtType, NodeType enums

e5c18f8

feat: add arena allocator with block chaining and max size

c9b7e86

feat: add AstNode (32-byte) and ParseResult structs

5c58e9f

feat: add token types and keyword lookup tables for MySQL and PostgreSQL

b51164e

feat: add dialect-templated tokenizer with MySQL and PostgreSQL support

4d97bb7

feat: add classifier and Tier 2 extractors for all statement types

e5c2733

chore: update .gitignore for new parser build artifacts

00ff816

Copilot AI review requested due to automatic review settings March 24, 2026 02:21

Copilot AI reviewed Mar 24, 2026

View reviewed changes

renecannao added 3 commits March 24, 2026 07:42

feat: add Pratt expression parser with literals, identifiers, and ope…

2d0cfda

…rators

feat: add SET deep parser with full AST for all SET variants

9d0dcf7

renecannao added 4 commits March 24, 2026 08:35

test: add extensive SET parser tests from ProxySQL test corpus

a8bf3f1

ci: add GitHub Actions workflow for build and test

5fce9ad

Runs on Ubuntu 22.04/24.04 with g++ and clang++, plus macOS. Builds libsqlparser.a and runs all tests via Makefile.new.

Add SELECT deep parser implementation plan

fc2efc3

Fixes from review: - is_alias_start() default returns true (not false) - TK_UNION/TK_OF/EXISTS added before select_parser.h compiles - EXISTS expression handling added to expression parser

feat: add SELECT deep parser with FROM, JOIN, WHERE, GROUP BY, ORDER …

db2a3e9

…BY, LIMIT, locking

renecannao added 2 commits March 24, 2026 10:08

Add query emitter implementation plan

b992978

Normalizations documented: - CHARSET -> CHARACTER SET in output - TO -> = for PostgreSQL SET assignments - ISOLATION LEVEL re-inserted by emitter from isolation level value

feat: add query emitter with round-trip support for SET and SELECT

b5dcbc7

renecannao added 3 commits March 24, 2026 10:41

Add prepared statement cache and benchmarks implementation plans

2a9afd0

Plan 5: Prepared statement cache with LRU, deep-copy AST, bindings-aware emitter Plan 6: Google Benchmark performance tests for all parser operations

feat: add prepared statement cache with parse_and_cache, execute, and…

8c4d7de

… bindings-aware emitter

feat: add Google Benchmark performance tests for parser operations

b7126b9

renecannao added 20 commits March 24, 2026 13:16

Add NODE_UPDATE_STMT and NODE_UPDATE_SET_CLAUSE to NodeType enum

ec435cc

New node types for the UPDATE deep parser. NODE_UPDATE_SET_ITEM and NODE_STMT_OPTIONS already exist from Plan 7.

Add UpdateParser<D> header-only template for UPDATE deep parsing

7c9a1fe

Supports MySQL (LOW_PRIORITY, IGNORE, multi-table JOINs, ORDER BY, LIMIT) and PostgreSQL (ONLY, alias, FROM clause, RETURNING). Uses ExpressionParser for expressions and TableRefParser for table refs.

Add DELETE node types (NODE_DELETE_STMT, NODE_DELETE_USING_CLAUSE)

8348618

Plan 9 Task 1: Add DELETE-specific node types to the NodeType enum for the upcoming DELETE deep parser.

Integrate DELETE parser with emitter and classifier

ac8752b

Plan 9 Task 3: Add emit_delete_stmt/emit_delete_using to emitter, add parse_delete() to Parser, replace extract_delete() dispatch with parse_delete() in classifier.

renecannao merged commit 288c4ed into main Mar 24, 2026
5 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add SQL parser foundation: core types, arena, tokenizer, classifier#4

Add SQL parser foundation: core types, arena, tokenizer, classifier#4
renecannao merged 44 commits intomainfrom
feat/sql-parser-foundation

renecannao commented Mar 24, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

Copilot AI Mar 24, 2026

Uh oh!

renecannao commented Mar 24, 2026

Uh oh!

renecannao commented Mar 24, 2026

Uh oh!

renecannao commented Mar 24, 2026

Uh oh!

renecannao commented Mar 24, 2026

Uh oh!

renecannao commented Mar 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

renecannao commented Mar 24, 2026

Summary

Key design decisions

Performance targets (validated in follow-up)

Test plan

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 24, 2026

Choose a reason for hiding this comment

Uh oh!

renecannao commented Mar 24, 2026

Plan 2 Complete: Expression Parser + SET Deep Parser

Added

Test Results

Uh oh!

renecannao commented Mar 24, 2026

Plan 3 Complete: SELECT Deep Parser

Added

Test Results

Uh oh!

renecannao commented Mar 24, 2026

Plan 4 Complete: Query Emitter

Added

Test Results

Current Progress

Uh oh!

renecannao commented Mar 24, 2026

Plans 5 & 6 Complete: Prepared Statement Cache + Benchmarks

Plan 5: Prepared Statement Cache

Plan 6: Performance Benchmarks

Final State

Uh oh!

renecannao commented Mar 24, 2026

Plans 7-11 Complete: All Features Implemented

Plan 7: INSERT/REPLACE Deep Parser (closes #5)

Plan 8: UPDATE Deep Parser (closes #6)

Plan 9: DELETE Deep Parser (closes #7)

Plan 10: Compound Queries (closes #8)

Plan 11: Query Digest (closes #9)

Final State

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants