Skip to content

feat(pkg-py): add PolarsLazySource for Polars LazyFrame support#191

Merged
cpsievert merged 25 commits intomainfrom
feat/py-lazy-frames
Jan 15, 2026
Merged

feat(pkg-py): add PolarsLazySource for Polars LazyFrame support#191
cpsievert merged 25 commits intomainfrom
feat/py-lazy-frames

Conversation

@cpsievert
Copy link
Copy Markdown
Contributor

@cpsievert cpsievert commented Jan 14, 2026

Summary

  • Adds PolarsLazySource DataSource implementation for Polars LazyFrames
  • Data stays lazy until the render boundary, enabling efficient handling of large datasets
  • Uses Polars' SQLContext for native lazy SQL execution
  • Adds AnyFrame type alias to support both DataFrame and LazyFrame return types

Usage

If using the Quick start .app(), just drop in a LazyFrame, same as you would a DataFrame.

import polars as pl
from querychat import QueryChat

# Scan a large parquet file lazily
lf = pl.scan_parquet("large_data.parquet")

# Pass directly to QueryChat - stays lazy until rendered
qc = QueryChat(data_source=lf, table_name="data")
qc.app()

If using the .df() to build a custom app, note that this will return the filtered LazyFrame, so you may need to .collect() as needed.

import polars as pl
from shiny.express import render, ui
from querychat import QueryChat

# Scan lazily - data stays on disk
lf = pl.scan_parquet("large_data.parquet")
qc = QueryChat(data_source=lf, table_name="sales")

# Sidebar with chat interface
with ui.sidebar():
    qc.ui()

# Main panel with data table
@render.data_frame
def filtered_data():
    # df() returns a LazyFrame - collect at render time
    return qc.df().collect()

Changes

  • _datasource.py: Add AnyFrame type alias, widen ABC return types, implement PolarsLazySource
  • _querychat.py: Update normalize_data_source() to detect LazyFrames, collect at render boundary
  • _querychat_module.py: Update ServerValues.df type hint
  • tools.py: Handle LazyFrames in query tool
  • data-sources.qmd: Add documentation for LazyFrame support
  • Tests: 15 new tests for PolarsLazySource, 1 integration test

Performance

With a 100K row test dataset:

Operation Eager Lazy Speedup
Load time 0.02s 0.0001s 200x
QueryChat init 0.26s 0.06s 4x
Query execution 0.50s 0.009s 55x
Demo script for manual verification

Run with:

export OPENAI_API_KEY="your-key"
cd pkg-py
uv run python examples/lazy_frame_demo.py --rows 10000000
#!/usr/bin/env python3
"""
Demo script comparing eager vs lazy data source performance.

This script demonstrates the performance benefits of using PolarsLazySource
with large datasets. It creates a synthetic dataset and compares:
1. Eager loading (all data in memory upfront)
2. Lazy loading (data stays on disk until needed)

Usage:
    # Set your API key first
    export OPENAI_API_KEY="your-key-here"

    # Run the demo
    cd pkg-py
    uv run python examples/lazy_frame_demo.py

    # Or with a custom number of rows (default: 10 million)
    uv run python examples/lazy_frame_demo.py --rows 50000000
"""

import argparse
import os
import tempfile
import time
from pathlib import Path

import polars as pl


def create_large_dataset(path: Path, n_rows: int) -> None:
    """Create a large parquet file for testing."""
    print(f"Creating dataset with {n_rows:,} rows...")
    start = time.perf_counter()

    # Generate data in chunks to avoid memory issues
    chunk_size = 1_000_000
    chunks_written = 0

    for i in range(0, n_rows, chunk_size):
        chunk_rows = min(chunk_size, n_rows - i)
        chunk = pl.DataFrame(
            {
                "id": range(i, i + chunk_rows),
                "category": [f"cat_{j % 100}" for j in range(chunk_rows)],
                "region": [["North", "South", "East", "West"][j % 4] for j in range(chunk_rows)],
                "value": [float(j % 1000) + 0.5 for j in range(chunk_rows)],
                "quantity": [j % 500 for j in range(chunk_rows)],
                "date": pl.Series([f"2024-{(j % 12) + 1:02d}-{(j % 28) + 1:02d}" for j in range(chunk_rows)]).str.to_date(),
            }
        )

        if chunks_written == 0:
            chunk.write_parquet(path)
        else:
            existing = pl.read_parquet(path)
            pl.concat([existing, chunk]).write_parquet(path)

        chunks_written += 1
        print(f"  Written {min(i + chunk_size, n_rows):,} / {n_rows:,} rows")

    elapsed = time.perf_counter() - start
    file_size_mb = path.stat().st_size / (1024 * 1024)
    print(f"Dataset created: {file_size_mb:.1f} MB in {elapsed:.1f}s\n")


def measure_memory() -> float:
    """Get current memory usage in MB (approximate)."""
    import psutil
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 * 1024)


def demo_eager_vs_lazy(parquet_path: Path) -> None:
    """Compare eager vs lazy data loading performance."""
    from querychat import QueryChat

    print("=" * 60)
    print("COMPARING EAGER VS LAZY DATA SOURCE")
    print("=" * 60)

    try:
        import psutil
        has_psutil = True
    except ImportError:
        has_psutil = False
        print("(Install psutil for memory usage tracking: pip install psutil)\n")

    # --- EAGER LOADING ---
    print("\n1. EAGER LOADING (polars.read_parquet → DataFrame)")
    print("-" * 50)

    if has_psutil:
        mem_before = measure_memory()

    start = time.perf_counter()
    df = pl.read_parquet(parquet_path)
    load_time = time.perf_counter() - start

    if has_psutil:
        mem_after = measure_memory()
        print(f"   Memory increase: {mem_after - mem_before:.1f} MB")

    print(f"   Load time: {load_time:.2f}s")
    print(f"   Rows loaded: {len(df):,}")

    start = time.perf_counter()
    qc_eager = QueryChat(data_source=df, table_name="sales", greeting="Hello!")
    init_time = time.perf_counter() - start
    print(f"   QueryChat init: {init_time:.2f}s")

    start = time.perf_counter()
    result = qc_eager.data_source.execute_query(
        "SELECT region, SUM(value) as total FROM sales GROUP BY region"
    )
    query_time = time.perf_counter() - start
    print(f"   Query execution: {query_time:.3f}s")
    print(f"   Result rows: {len(result)}")

    del df, qc_eager, result
    import gc
    gc.collect()

    # --- LAZY LOADING ---
    print("\n2. LAZY LOADING (polars.scan_parquet → LazyFrame)")
    print("-" * 50)

    if has_psutil:
        mem_before = measure_memory()

    start = time.perf_counter()
    lf = pl.scan_parquet(parquet_path)
    load_time = time.perf_counter() - start

    if has_psutil:
        mem_after = measure_memory()
        print(f"   Memory increase: {mem_after - mem_before:.1f} MB")

    print(f"   'Load' time: {load_time:.4f}s (just metadata!)")

    start = time.perf_counter()
    qc_lazy = QueryChat(data_source=lf, table_name="sales", greeting="Hello!")
    init_time = time.perf_counter() - start
    print(f"   QueryChat init: {init_time:.2f}s")

    start = time.perf_counter()
    result_lazy = qc_lazy.data_source.execute_query(
        "SELECT region, SUM(value) as total FROM sales GROUP BY region"
    )
    query_time = time.perf_counter() - start
    print(f"   Query execution (lazy): {query_time:.3f}s")

    start = time.perf_counter()
    result_collected = result_lazy.collect()
    collect_time = time.perf_counter() - start
    print(f"   Collect time: {collect_time:.3f}s")
    print(f"   Result rows: {len(result_collected)}")

    print("\n" + "=" * 60)
    print("SUMMARY: Lazy loading is dramatically faster for large datasets!")
    print("=" * 60)


def main():
    parser = argparse.ArgumentParser(description="Demo lazy vs eager data loading")
    parser.add_argument("--rows", type=int, default=10_000_000)
    parser.add_argument("--interactive", action="store_true")
    parser.add_argument("--data-path", type=str, default=None)
    args = parser.parse_args()

    if args.data_path:
        parquet_path = Path(args.data_path)
    else:
        temp_dir = tempfile.mkdtemp()
        parquet_path = Path(temp_dir) / "large_sales_data.parquet"
        create_large_dataset(parquet_path, args.rows)

    try:
        demo_eager_vs_lazy(parquet_path)
    finally:
        if not args.data_path and parquet_path.exists():
            parquet_path.unlink()
            parquet_path.parent.rmdir()


if __name__ == "__main__":
    main()

Sample output (100K rows):

============================================================
COMPARING EAGER VS LAZY DATA SOURCE
============================================================

1. EAGER LOADING (polars.read_parquet → DataFrame)
--------------------------------------------------
   Load time: 0.02s
   Rows loaded: 100,000
   QueryChat init: 0.26s
   Query execution: 0.495s
   Result rows: 4

2. LAZY LOADING (polars.scan_parquet → LazyFrame)
--------------------------------------------------
   'Load' time: 0.0001s (just metadata!)
   QueryChat init: 0.06s
   Query execution (lazy): 0.009s
   Collect time: 0.007s
   Result rows: 4

Test plan

  • All 131 tests pass
  • Pyright passes with 0 errors on source files
  • New PolarsLazySource tests cover init, execute_query, get_data, get_schema, test_query
  • Integration test verifies QueryChat accepts LazyFrame and creates PolarsLazySource
  • Demo script validates performance benefits

🤖 Generated with Claude Code

@cpsievert cpsievert force-pushed the feat/py-lazy-frames branch from bb84cdd to 438c92e Compare January 14, 2026 16:45
@cpsievert cpsievert marked this pull request as draft January 14, 2026 16:49
cpsievert and others added 3 commits January 14, 2026 10:51
Add a new DataSource implementation that keeps Polars LazyFrames lazy
until the render boundary. Key changes:

- Add `AnyFrame` type alias (`Union[nw.DataFrame, nw.LazyFrame]`)
- Widen DataSource ABC return types to support lazy frames
- Implement `PolarsLazySource` using Polars SQLContext for lazy SQL
- Update `normalize_data_source()` to detect and route LazyFrames
- Collect LazyFrames at render boundary in `app()` method
- Update type hints throughout

Usage:
```python
import polars as pl
from querychat import QueryChat

lf = pl.scan_parquet("large_data.parquet")
qc = QueryChat(data_source=lf, table_name="data")
```

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PolarsLazySource._polars_dtype_to_sql was mapping pl.Time to "TIMESTAMP"
but it should map to "TIME". Time-only values are not timestamps.

Also added noqa comment for PLR0911 (too many return statements) since
the function now has 7 return statements after the fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert cpsievert force-pushed the feat/py-lazy-frames branch from 438c92e to e5555b3 Compare January 14, 2026 16:51
cpsievert and others added 3 commits January 14, 2026 10:56
Previously, test_query only validated schema structure via collect_schema()
without executing the query. This meant runtime errors (e.g., invalid casts)
wouldn't surface until actual collection.

Now test_query collects one row to catch runtime errors, matching the behavior
of DataFrameSource.test_query. The return type changes from LazyFrame to
DataFrame since we've already done the work.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The noqa: A005 comment was accidentally removed from types/__init__.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert cpsievert requested a review from Copilot January 15, 2026 00:20

This comment was marked as resolved.

- Remove lazy_frame_demo.py example script
- Fix empty LazyFrame handling in get_schema to prevent .row() failure
- Add .head() limit when collecting unique values to reduce memory usage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert

This comment has been minimized.

cpsievert and others added 10 commits January 15, 2026 09:20
The abstract test_query() method was declared to return AnyFrame but all
concrete implementations return nw.DataFrame. This is intentional since
test_query collects data to catch runtime errors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…LazyFrame support

- Batch categorical value collection into single scan using implode()
  instead of N separate scans (one per categorical column)
- Extract _get_categorical_values() helper method for clarity
- Rename AnyFrame to LazyOrDataFrame for better readability
- Store native Polars LazyFrame internally instead of narwhals wrapper
- Simplify df_to_html() implementation
- Improve error messages for unsupported LazyFrame backends

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…eta dataclass

Introduce a ColumnMeta dataclass to consolidate column metadata into a
single data structure, replacing multiple parallel lists and dicts.

Changes:
- Add ColumnMeta dataclass with name, sql_type, kind, min_val, max_val, categories
- Refactor get_schema() into three clear steps: classify, add stats, format
- Extract static helper methods: _make_column_meta, _add_column_stats, _format_schema
- Use .row(0, named=True) consistently for extracting aggregate results
- Fix test to check native LazyFrame identity instead of wrapper identity

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update DataFrameSource to require narwhals DataFrame as input, removing
implicit conversion from raw pandas/polars DataFrames. Update all tests
to wrap DataFrames with nw.from_native().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace specific benchmark numbers with qualitative explanation of lazy
evaluation benefits (deferred loading, query optimization, reduced memory).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…Source

Remove conditional that skipped range output when both min/max were None,
matching DataFrameSource behavior of always showing range info.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve conflicts and integrate LazyFrame support with new multi-framework
architecture:
- Update _querychat_base.py with PolarsLazySource support in normalize_data_source
- Add LazyFrame handling in _shiny.py (collect before render)
- Update _shiny_module.py with LazyOrDataFrame type
- Keep GT-based df_to_html in _utils.py
- Combine dev dependencies (polars) with new docs dependencies

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update polars tests to wrap DataFrames with nw.from_native()
- Fix df_to_html test to match actual truncation message format

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert cpsievert requested a review from Copilot January 15, 2026 19:57

This comment was marked as resolved.

- Update StateDictAccessorMixin.df() to return LazyOrDataFrame
- Update AppState.get_current_data() to return LazyOrDataFrame
- Update StreamlitQueryChat.df() to return LazyOrDataFrame

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Comment thread pkg-py/src/querychat/tools.py Outdated
Comment thread pkg-py/src/querychat/tools.py Outdated

@abstractmethod
def execute_query(self, query: str) -> nw.DataFrame:
def execute_query(self, query: str) -> DataOrLazyFrame:
Copy link
Copy Markdown
Contributor Author

@cpsievert cpsievert Jan 15, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This will be an annoying thing to workaround as a user with type checking enabled.

I have some ideas on how to improve this, but I think it's better to address that in a follow up PR.

LazyFrame doesn't have .shape or .to_pandas() attributes directly -
these require the frame to be collected first. Added isinstance checks
and collection calls in _dash.py, _gradio.py, and _streamlit.py.

Also fixed test assertion in test_df_to_html.py to match actual
implementation output format.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert cpsievert marked this pull request as ready for review January 15, 2026 21:05
@cpsievert cpsievert requested a review from gadenbuie January 15, 2026 21:05
@cpsievert cpsievert merged commit f01820f into main Jan 15, 2026
13 checks passed
@cpsievert cpsievert deleted the feat/py-lazy-frames branch January 15, 2026 21:09
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants