Conversation
|
/format |
|
❌ |
|
/format |
|
✨ Formatted and pushed. |
georgestagg
left a comment
There was a problem hiding this comment.
I am 28/57 files, but posting an initial set of comments now to get the ball rolling.
Also consider the following suggestions, from codex (with a grain of salt and my apologies for the direct LLM copy/paste):
Finding 2 — DataFrame::drop_by_index loses row count
src/dataframe.rs:287-288 — When dropping the last column, it returns Self::empty() which is 0×0. Annotation
layers that have only literal columns can collapse to zero rows, causing marks to disappear silently.
Finding 3 — drop_many swallows errors
src/dataframe.rs:210-217 — Returns Self::empty() both when all columns are dropped (same row-count issue) and
when RecordBatch::try_new fails (silent data loss).
Co-authored-by: George Stagg <georgestagg@gmail.com>
|
Other than clippy, who is needy. |
Replace polars with arrow-rs
Why?
polars is the single largest dependency in ggsql — 328 transitive crates — yet it's used almost entirely as a passive data container. The real work happens in SQL (via DuckDB/SQLite), and DuckDB already requires arrow (92 crates). Dropping polars eliminates ~236 transitive crates with no loss of functionality.
Verified dep count: ggsql drops from 418 → 182 transitive crates.
Approach: thin DataFrame wrapper around arrow::RecordBatch
Rather than using RecordBatch directly (immutable, no column-by-name lookup, missing constructors), the PR introduces a thin wrapper that provides the ~12 methods the codebase actually uses. This was the lowest-churn path across ~50 affected files.
Three new modules form the migration foundation:
The hard part: position adjustments
stack.rs was the only place using polars' lazy API with grouped window functions (cum_sum().over(), shift(), fill_null()). We considered pushing position adjustments into SQL, but scale-type inference happens after query execution, so we'd hit a chicken-and-egg problem. Instead, ~50 lines of polars lazy expressions became ~120 lines of arrow compute calls in stack.rs, using the primitives in compute.rs. dodge.rs and jitter.rs followed the same pattern.
The position-adjustment tests are the primary acceptance criteria here — they encode a lot of tricky numeric behavior (fill/center modes, grouped cumsums, null handling).
Migrations across the codebase
Gotchas worth reviewer attention
Test coverage
What to look at first as a reviewer