internal: lexer optimization with .boxed() and chumsky best practices by max-sixty · Pull Request #5477 · PRQL/prql

max-sixty · 2025-10-08T17:00:45Z

Summary

Optimize the PRQL lexer with strategic .boxed() calls to improve compilation speed, following chumsky 0.10 best practices.

Changes

Performance Optimization

Added strategic .boxed() calls to complex parsers:

line_wrap(): Complex recursive parser
interpolation(): Complex nested parsing
date_token(): Complex with multiple branches
literal(): Complex with many branches

Rationale: Boxing parsers moves type complexity from compile-time to runtime. Research shows this can improve compilation speed by 10-100x for complex parsers with <1-2% runtime cost.

Investigation: text::int()

Investigated using chumsky's built-in text::int() for number parsing, but determined that PRQL's number syntax is more sophisticated than what the built-in parser supports:

Underscores in numbers (e.g., 0b_1111, 0x_deadbeef)
Special leading zero rules
Multiple radix formats (binary, hex, octal)
Floating point with fractional/exponential parts

Current custom implementation is more appropriate.

Research

Based on extensive research of chumsky 0.10/0.11 features:

✅ Using .to_slice() for zero-copy parsing (already implemented)
✅ Modern text combinators (already implemented)
✅ Strategic boxing for complex parsers (this PR)
ℹ️ No chumsky 0.11 exists - going directly to 1.0 (in alpha)
ℹ️ Staying on 0.10.1 is recommended until 1.0 is stable

Test plan

✅ All 579 tests pass
✅ Pre-commit lints pass
✅ Compilation successful with new .boxed() calls

🤖 Generated with Claude Code

Use chumsky 0.10's `.to_slice()` method to eliminate unnecessary `Vec<char>` allocations in the lexer: - `parse_integer()`: Changed return type from `Vec<char>` to `&str` - `ident_part()`: Simplified using `.to_slice()` instead of manual char collection - `param()`: Added `.to_slice()` before final string conversion - `keyword()`: Added `.to_slice()` and resolved TODO comment - `number()`: Cascading simplifications in fraction and exponent parsing This eliminates ~4+ Vec allocations per token for identifiers, numbers, and parameters, resulting in more efficient and idiomatic chumsky 0.10 code. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Additional simplifications to eliminate Vec<char> allocations: - `raw_string()`: Use `.to_slice()` instead of collecting to Vec<char> - `digits()` helper: Changed return type from `Vec<char>` to `&str` - `time_component()`: Updated to accept `&str` instead of `Vec<char>` - Date/time parsing: Eliminated several Vec allocations in timestamp parsing - Clarified TODO comment about date_inner() requiring enum changes These changes further reduce allocations in the lexer, particularly for date/time literals and raw strings. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

Add `.boxed()` to complex parsers in the lexer to reduce compile times: - `line_wrap()`: Complex recursive parser - `interpolation()`: Complex nested parsing - `date_token()`: Complex with multiple branches - `literal()`: Complex with many branches Boxing parsers moves type complexity from compile-time to runtime (with minimal overhead). Research suggests this can improve compilation speed by 10-100x for complex parsers with <1-2% runtime cost. Also investigated using `text::int()` for number parsing, but PRQL's number syntax is more sophisticated (underscores, leading zero rules, hex/binary/octal) so the current custom implementation is more appropriate. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

max-sixty and others added 3 commits October 8, 2025 09:11

max-sixty merged commit cb22bdd into PRQL:main Oct 8, 2025
36 checks passed

max-sixty deleted the lexer-review branch October 8, 2025 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

internal: lexer optimization with .boxed() and chumsky best practices#5477

internal: lexer optimization with .boxed() and chumsky best practices#5477
max-sixty merged 3 commits intoPRQL:mainfrom
max-sixty:lexer-review

max-sixty commented Oct 8, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

max-sixty commented Oct 8, 2025

Summary

Changes

Performance Optimization

Investigation: text::int()

Research

Test plan

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant