No-polars in ggsql by thomasp85 · Pull Request #350 · posit-dev/ggsql

thomasp85 · 2026-04-22T07:36:46Z

Replace polars with arrow-rs

Why?

polars is the single largest dependency in ggsql — 328 transitive crates — yet it's used almost entirely as a passive data container. The real work happens in SQL (via DuckDB/SQLite), and DuckDB already requires arrow (92 crates). Dropping polars eliminates ~236 transitive crates with no loss of functionality.

Verified dep count: ggsql drops from 418 → 182 transitive crates.

Approach: thin DataFrame wrapper around arrow::RecordBatch

Rather than using RecordBatch directly (immutable, no column-by-name lookup, missing constructors), the PR introduces a thin wrapper that provides the ~12 methods the codebase actually uses. This was the lowest-churn path across ~50 affected files.

Three new modules form the migration foundation:

src/dataframe.rs: The DataFrame wrapper + df! test macro. Wraps RecordBatch, exposes height/width/column/with_column/rename/drop/replace/slice/…
src/array_util.rs: Replaces polars' series.f64() / series.str() with as_f64(array) / as_str(array) downcasts; plus constructors, cast_array, fill_null_f64, value_to_string
src/compute.rs: Grouped window ops for position adjustments: sort_dataframe, compute_group_ids, grouped_cumsum, grouped_cumsum_lag, grouped_sum_broadcast

The hard part: position adjustments

stack.rs was the only place using polars' lazy API with grouped window functions (cum_sum().over(), shift(), fill_null()). We considered pushing position adjustments into SQL, but scale-type inference happens after query execution, so we'd hit a chicken-and-egg problem. Instead, ~50 lines of polars lazy expressions became ~120 lines of arrow compute calls in stack.rs, using the primitives in compute.rs. dodge.rs and jitter.rs followed the same pattern.

The position-adjustment tests are the primary acceptance criteria here — they encode a lot of tricky numeric behavior (fill/center modes, grouped cumsums, null handling).

Migrations across the codebase

Readers: duckdb.rs, sqlite.rs, odbc.rs — replaced polars::Series builders with arrow array builders. In DuckDB, dataframe_to_arrow_params simplifies to df.inner().clone() since our DataFrame is a RecordBatch.
Parquet reading (reader/data.rs): polars::ParquetReader → parquet::arrow::arrow_reader::ParquetRecordBatchReaderBuilder. Added parquet as a direct dep.
DataType references (~25 files): mechanical rename — DataType::String → Utf8, Date → Date32, Datetime(µs, tz) → Timestamp(Microsecond, tz), Time → Time64(Nanosecond), Categorical → Utf8.
Writers (vegalite/data.rs, encoding.rs, layer.rs): series downcasts → arrow downcasts with explicit null checks (arrow doesn't auto-skip nulls the way polars' iterators do).
ggsql-wasm: same pattern — polars Series construction → Arc etc. WASM-specific getrandom/uuid feature overrides added for wasm32-unknown-unknown.

Gotchas worth reviewer attention

Parquet file regeneration. The bundled penguins.parquet and airquality.parquet were originally written by R's nanoparquet package, which produced an ARROW:schema blob that fails flatbuffers alignment in arrow-rs. They were regenerated with arrow-rs itself. Documentation at the top of reader/data.rs calls out which writers are known compatible (pyarrow/arrow-rs/DuckDB) vs. incompatible (nanoparquet). A new test all_builtin_parquets_load iterates KNOWN_DATASETS so CI catches any future incompatible additions.
Temporal ↔ floating casts. Arrow's compute::cast can't cross the temporal/floating boundary directly — Date32 → Float64 fails, you have to go via Int32. Rather than special-case every call site (there are ~15 of them), array_util::cast_array was extended to bridge these conversions transparently via the integer backing type. This was discovered by two user-reported bugs during review (histogram on an Int64 column; boxplot with a Date x-axis + SCALE BINNED).
ggsql-python removed from the monorepo. It now lives in its own repo, so this PR doesn't touch it.
Versioning. All workspace crates now inherit version.workspace = true. pyproject.toml in the (external) Python package is not auto-synced — noted for a future release-script task.

Test coverage

1343 unit tests pass, 0 failed, 1 ignored.
All existing position-adjustment tests were preserved and pass unchanged, which was the strictest signal that the arrow rewrite of stack/dodge/jitter is behavior-equivalent.
New tests added for the cases that surfaced during review: cast_array temporal↔floating bridging in array_util, apply_oob_to_column_numeric with Date32 in execute/scale.rs, and the histogram null-error path.

What to look at first as a reviewer

src/dataframe.rs — the API surface everything else depends on. If this is right, the rest of the churn is mechanical.
src/plot/layer/position/stack.rs — the only genuinely non-mechanical rewrite; worth reading against the polars version in main to convince yourself the arrow compute chain is equivalent.
src/array_util.rs::cast_array — the temporal/floating bridge. Subtle but high-leverage because many call sites rely on it.
src/reader/data.rs — parquet compatibility docs + the iterating test that keeps future datasets honest.

PR summary written by Claude

thomasp85 · 2026-04-22T10:01:55Z

/format

github-actions · 2026-04-22T10:02:07Z

❌ /format failed. If this is a fork PR, make sure "Allow edits from maintainers" is enabled.

thomasp85 · 2026-04-22T10:06:00Z

/format

github-actions · 2026-04-22T10:06:21Z

✨ Formatted and pushed.

teunbrand · 2026-04-22T14:44:20Z

I think a few claude.md lines still have mentions of polars

ggsql/CLAUDE.md

Line 127 in d053660

│ (Polars) │ │

ggsql/CLAUDE.md

Line 537 in d053660

- SQL execution → Polars DataFrame conversion

ggsql/CLAUDE.md

Line 1336 in d053660

ResultSet → DataFrame (Polars)

georgestagg

I am 28/57 files, but posting an initial set of comments now to get the ball rolling.

Also consider the following suggestions, from codex (with a grain of salt and my apologies for the direct LLM copy/paste):

Finding 2 — DataFrame::drop_by_index loses row count
src/dataframe.rs:287-288 — When dropping the last column, it returns Self::empty() which is 0×0. Annotation
layers that have only literal columns can collapse to zero rows, causing marks to disappear silently.

Finding 3 — drop_many swallows errors
src/dataframe.rs:210-217 — Returns Self::empty() both when all columns are dropped (same row-count issue) and
when RecordBatch::try_new fails (silent data loss).

Co-authored-by: George Stagg <georgestagg@gmail.com>

…-polar

georgestagg

This all looks reasonable to me, great work! I'm sure I'll have missed something, but this PR is large so let's get it in ASAP and fix any issues with followup PRs.

georgestagg · 2026-04-23T12:27:20Z

Other than clippy, who is needy.

thomasp85 added 9 commits April 22, 2026 08:55

remove polars from main project

39ae468

remove polars from wasm

81803bf

Merge commit '0c8ad3f03ac06f6e288087e64fa73b899c1f7006'

82de7b9

fix wasm and reformat

19d6c60

appease clippy

2a7ec6a

add changelog

6a27d38

reformat again

d58f180

fix histogram stat

df1297a

fix casting of arrays

ab91cc8

style: cargo fmt

ffafbd8

thomasp85 requested a review from georgestagg April 22, 2026 11:08

georgestagg reviewed Apr 22, 2026

View reviewed changes

Comment thread src/array_util.rs Outdated

Comment thread src/reader/data.rs Outdated

Comment thread src/reader/data.rs

thomasp85 and others added 4 commits April 23, 2026 11:27

remove polars mentions in CLAUDE.md

217c4ac

Apply suggestions from code review

e659749

Co-authored-by: George Stagg <georgestagg@gmail.com>

Merge branch 'no-polar' of https://github.com/posit-dev/ggsql into no…

dbe0978

…-polar

Apply suggestions from Codex

3be0b86

georgestagg approved these changes Apr 23, 2026

View reviewed changes

Comment thread src/writer/vegalite/layer.rs Outdated

Comment thread src/writer/vegalite/layer.rs Outdated

thomasp85 added 2 commits April 23, 2026 14:51

Fix leakage and refactor

00528d9

reformat

5258a6b

thomasp85 merged commit 5a0fbeb into main Apr 23, 2026
2 checks passed

mkcorneli mentioned this pull request Apr 27, 2026

Add Exasol ODBC dialect #386

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

No-polars in ggsql#350

No-polars in ggsql#350
thomasp85 merged 16 commits intomainfrom
no-polar

thomasp85 commented Apr 22, 2026 •

edited

Loading

Uh oh!

thomasp85 commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

thomasp85 commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

teunbrand commented Apr 22, 2026

Uh oh!

georgestagg left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

georgestagg left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

georgestagg commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

thomasp85 commented Apr 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Replace polars with arrow-rs

Why?

Approach: thin DataFrame wrapper around arrow::RecordBatch

The hard part: position adjustments

Migrations across the codebase

Gotchas worth reviewer attention

Test coverage

What to look at first as a reviewer

Uh oh!

thomasp85 commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

thomasp85 commented Apr 22, 2026

Uh oh!

github-actions Bot commented Apr 22, 2026

Uh oh!

teunbrand commented Apr 22, 2026

Uh oh!

georgestagg left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

georgestagg left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

georgestagg commented Apr 23, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

thomasp85 commented Apr 22, 2026 •

edited

Loading

georgestagg left a comment •

edited

Loading