feat(pkg-py): add PolarsLazySource for Polars LazyFrame support by cpsievert · Pull Request #191 · posit-dev/querychat

cpsievert · 2026-01-14T03:09:54Z

Summary

Adds PolarsLazySource DataSource implementation for Polars LazyFrames
Data stays lazy until the render boundary, enabling efficient handling of large datasets
Uses Polars' SQLContext for native lazy SQL execution
Adds AnyFrame type alias to support both DataFrame and LazyFrame return types

Usage

If using the Quick start .app(), just drop in a LazyFrame, same as you would a DataFrame.

import polars as pl
from querychat import QueryChat

# Scan a large parquet file lazily
lf = pl.scan_parquet("large_data.parquet")

# Pass directly to QueryChat - stays lazy until rendered
qc = QueryChat(data_source=lf, table_name="data")
qc.app()

If using the .df() to build a custom app, note that this will return the filtered LazyFrame, so you may need to .collect() as needed.

import polars as pl
from shiny.express import render, ui
from querychat import QueryChat

# Scan lazily - data stays on disk
lf = pl.scan_parquet("large_data.parquet")
qc = QueryChat(data_source=lf, table_name="sales")

# Sidebar with chat interface
with ui.sidebar():
    qc.ui()

# Main panel with data table
@render.data_frame
def filtered_data():
    # df() returns a LazyFrame - collect at render time
    return qc.df().collect()

Changes

_datasource.py: Add AnyFrame type alias, widen ABC return types, implement PolarsLazySource
_querychat.py: Update normalize_data_source() to detect LazyFrames, collect at render boundary
_querychat_module.py: Update ServerValues.df type hint
tools.py: Handle LazyFrames in query tool
data-sources.qmd: Add documentation for LazyFrame support
Tests: 15 new tests for PolarsLazySource, 1 integration test

Performance

With a 100K row test dataset:

Operation	Eager	Lazy	Speedup
Load time	0.02s	0.0001s	200x
QueryChat init	0.26s	0.06s	4x
Query execution	0.50s	0.009s	55x

Demo script for manual verification

Run with:

export OPENAI_API_KEY="your-key"
cd pkg-py
uv run python examples/lazy_frame_demo.py --rows 10000000

#!/usr/bin/env python3
"""
Demo script comparing eager vs lazy data source performance.

This script demonstrates the performance benefits of using PolarsLazySource
with large datasets. It creates a synthetic dataset and compares:
1. Eager loading (all data in memory upfront)
2. Lazy loading (data stays on disk until needed)

Usage:
    # Set your API key first
    export OPENAI_API_KEY="your-key-here"

    # Run the demo
    cd pkg-py
    uv run python examples/lazy_frame_demo.py

    # Or with a custom number of rows (default: 10 million)
    uv run python examples/lazy_frame_demo.py --rows 50000000
"""

import argparse
import os
import tempfile
import time
from pathlib import Path

import polars as pl


def create_large_dataset(path: Path, n_rows: int) -> None:
    """Create a large parquet file for testing."""
    print(f"Creating dataset with {n_rows:,} rows...")
    start = time.perf_counter()

    # Generate data in chunks to avoid memory issues
    chunk_size = 1_000_000
    chunks_written = 0

    for i in range(0, n_rows, chunk_size):
        chunk_rows = min(chunk_size, n_rows - i)
        chunk = pl.DataFrame(
            {
                "id": range(i, i + chunk_rows),
                "category": [f"cat_{j % 100}" for j in range(chunk_rows)],
                "region": [["North", "South", "East", "West"][j % 4] for j in range(chunk_rows)],
                "value": [float(j % 1000) + 0.5 for j in range(chunk_rows)],
                "quantity": [j % 500 for j in range(chunk_rows)],
                "date": pl.Series([f"2024-{(j % 12) + 1:02d}-{(j % 28) + 1:02d}" for j in range(chunk_rows)]).str.to_date(),
            }
        )

        if chunks_written == 0:
            chunk.write_parquet(path)
        else:
            existing = pl.read_parquet(path)
            pl.concat([existing, chunk]).write_parquet(path)

        chunks_written += 1
        print(f"  Written {min(i + chunk_size, n_rows):,} / {n_rows:,} rows")

    elapsed = time.perf_counter() - start
    file_size_mb = path.stat().st_size / (1024 * 1024)
    print(f"Dataset created: {file_size_mb:.1f} MB in {elapsed:.1f}s\n")


def measure_memory() -> float:
    """Get current memory usage in MB (approximate)."""
    import psutil
    process = psutil.Process(os.getpid())
    return process.memory_info().rss / (1024 * 1024)


def demo_eager_vs_lazy(parquet_path: Path) -> None:
    """Compare eager vs lazy data loading performance."""
    from querychat import QueryChat

    print("=" * 60)
    print("COMPARING EAGER VS LAZY DATA SOURCE")
    print("=" * 60)

    try:
        import psutil
        has_psutil = True
    except ImportError:
        has_psutil = False
        print("(Install psutil for memory usage tracking: pip install psutil)\n")

    # --- EAGER LOADING ---
    print("\n1. EAGER LOADING (polars.read_parquet → DataFrame)")
    print("-" * 50)

    if has_psutil:
        mem_before = measure_memory()

    start = time.perf_counter()
    df = pl.read_parquet(parquet_path)
    load_time = time.perf_counter() - start

    if has_psutil:
        mem_after = measure_memory()
        print(f"   Memory increase: {mem_after - mem_before:.1f} MB")

    print(f"   Load time: {load_time:.2f}s")
    print(f"   Rows loaded: {len(df):,}")

    start = time.perf_counter()
    qc_eager = QueryChat(data_source=df, table_name="sales", greeting="Hello!")
    init_time = time.perf_counter() - start
    print(f"   QueryChat init: {init_time:.2f}s")

    start = time.perf_counter()
    result = qc_eager.data_source.execute_query(
        "SELECT region, SUM(value) as total FROM sales GROUP BY region"
    )
    query_time = time.perf_counter() - start
    print(f"   Query execution: {query_time:.3f}s")
    print(f"   Result rows: {len(result)}")

    del df, qc_eager, result
    import gc
    gc.collect()

    # --- LAZY LOADING ---
    print("\n2. LAZY LOADING (polars.scan_parquet → LazyFrame)")
    print("-" * 50)

    if has_psutil:
        mem_before = measure_memory()

    start = time.perf_counter()
    lf = pl.scan_parquet(parquet_path)
    load_time = time.perf_counter() - start

    if has_psutil:
        mem_after = measure_memory()
        print(f"   Memory increase: {mem_after - mem_before:.1f} MB")

    print(f"   'Load' time: {load_time:.4f}s (just metadata!)")

    start = time.perf_counter()
    qc_lazy = QueryChat(data_source=lf, table_name="sales", greeting="Hello!")
    init_time = time.perf_counter() - start
    print(f"   QueryChat init: {init_time:.2f}s")

    start = time.perf_counter()
    result_lazy = qc_lazy.data_source.execute_query(
        "SELECT region, SUM(value) as total FROM sales GROUP BY region"
    )
    query_time = time.perf_counter() - start
    print(f"   Query execution (lazy): {query_time:.3f}s")

    start = time.perf_counter()
    result_collected = result_lazy.collect()
    collect_time = time.perf_counter() - start
    print(f"   Collect time: {collect_time:.3f}s")
    print(f"   Result rows: {len(result_collected)}")

    print("\n" + "=" * 60)
    print("SUMMARY: Lazy loading is dramatically faster for large datasets!")
    print("=" * 60)


def main():
    parser = argparse.ArgumentParser(description="Demo lazy vs eager data loading")
    parser.add_argument("--rows", type=int, default=10_000_000)
    parser.add_argument("--interactive", action="store_true")
    parser.add_argument("--data-path", type=str, default=None)
    args = parser.parse_args()

    if args.data_path:
        parquet_path = Path(args.data_path)
    else:
        temp_dir = tempfile.mkdtemp()
        parquet_path = Path(temp_dir) / "large_sales_data.parquet"
        create_large_dataset(parquet_path, args.rows)

    try:
        demo_eager_vs_lazy(parquet_path)
    finally:
        if not args.data_path and parquet_path.exists():
            parquet_path.unlink()
            parquet_path.parent.rmdir()


if __name__ == "__main__":
    main()

Sample output (100K rows):

============================================================
COMPARING EAGER VS LAZY DATA SOURCE
============================================================

1. EAGER LOADING (polars.read_parquet → DataFrame)
--------------------------------------------------
   Load time: 0.02s
   Rows loaded: 100,000
   QueryChat init: 0.26s
   Query execution: 0.495s
   Result rows: 4

2. LAZY LOADING (polars.scan_parquet → LazyFrame)
--------------------------------------------------
   'Load' time: 0.0001s (just metadata!)
   QueryChat init: 0.06s
   Query execution (lazy): 0.009s
   Collect time: 0.007s
   Result rows: 4

Test plan

All 131 tests pass
Pyright passes with 0 errors on source files
New PolarsLazySource tests cover init, execute_query, get_data, get_schema, test_query
Integration test verifies QueryChat accepts LazyFrame and creates PolarsLazySource
Demo script validates performance benefits

🤖 Generated with Claude Code

Add a new DataSource implementation that keeps Polars LazyFrames lazy until the render boundary. Key changes: - Add `AnyFrame` type alias (`Union[nw.DataFrame, nw.LazyFrame]`) - Widen DataSource ABC return types to support lazy frames - Implement `PolarsLazySource` using Polars SQLContext for lazy SQL - Update `normalize_data_source()` to detect and route LazyFrames - Collect LazyFrames at render boundary in `app()` method - Update type hints throughout Usage: ```python import polars as pl from querychat import QueryChat lf = pl.scan_parquet("large_data.parquet") qc = QueryChat(data_source=lf, table_name="data") ``` Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

PolarsLazySource._polars_dtype_to_sql was mapping pl.Time to "TIMESTAMP" but it should map to "TIME". Time-only values are not timestamps. Also added noqa comment for PLR0911 (too many return statements) since the function now has 7 return statements after the fix. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Previously, test_query only validated schema structure via collect_schema() without executing the query. This meant runtime errors (e.g., invalid casts) wouldn't surface until actual collection. Now test_query collects one row to catch runtime errors, matching the behavior of DataFrameSource.test_query. The return type changes from LazyFrame to DataFrame since we've already done the work. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The noqa: A005 comment was accidentally removed from types/__init__.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Remove lazy_frame_demo.py example script - Fix empty LazyFrame handling in get_schema to prevent .row() failure - Add .head() limit when collecting unique values to reduce memory usage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

The abstract test_query() method was declared to return AnyFrame but all concrete implementations return nw.DataFrame. This is intentional since test_query collects data to catch runtime errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…LazyFrame support - Batch categorical value collection into single scan using implode() instead of N separate scans (one per categorical column) - Extract _get_categorical_values() helper method for clarity - Rename AnyFrame to LazyOrDataFrame for better readability - Store native Polars LazyFrame internally instead of narwhals wrapper - Simplify df_to_html() implementation - Improve error messages for unsupported LazyFrame backends Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…eta dataclass Introduce a ColumnMeta dataclass to consolidate column metadata into a single data structure, replacing multiple parallel lists and dicts. Changes: - Add ColumnMeta dataclass with name, sql_type, kind, min_val, max_val, categories - Refactor get_schema() into three clear steps: classify, add stats, format - Extract static helper methods: _make_column_meta, _add_column_stats, _format_schema - Use .row(0, named=True) consistently for extracting aggregate results - Fix test to check native LazyFrame identity instead of wrapper identity Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Update DataFrameSource to require narwhals DataFrame as input, removing implicit conversion from raw pandas/polars DataFrames. Update all tests to wrap DataFrames with nw.from_native(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Replace specific benchmark numbers with qualitative explanation of lazy evaluation benefits (deferred loading, query optimization, reduced memory). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

…Source Remove conditional that skipped range output when both min/max were None, matching DataFrameSource behavior of always showing range info. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

Resolve conflicts and integrate LazyFrame support with new multi-framework architecture: - Update _querychat_base.py with PolarsLazySource support in normalize_data_source - Add LazyFrame handling in _shiny.py (collect before render) - Update _shiny_module.py with LazyOrDataFrame type - Keep GT-based df_to_html in _utils.py - Combine dev dependencies (polars) with new docs dependencies Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update polars tests to wrap DataFrames with nw.from_native() - Fix df_to_html test to match actual truncation message format Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

- Update StateDictAccessorMixin.df() to return LazyOrDataFrame - Update AppState.get_current_data() to return LazyOrDataFrame - Update StreamlitQueryChat.df() to return LazyOrDataFrame Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cpsievert · 2026-01-15T20:55:02Z


    @abstractmethod
-    def execute_query(self, query: str) -> nw.DataFrame:
+    def execute_query(self, query: str) -> DataOrLazyFrame:


This will be an annoying thing to workaround as a user with type checking enabled.

I have some ideas on how to improve this, but I think it's better to address that in a follow up PR.

LazyFrame doesn't have .shape or .to_pandas() attributes directly - these require the frame to be collected first. Added isinstance checks and collection calls in _dash.py, _gradio.py, and _streamlit.py. Also fixed test assertion in test_df_to_html.py to match actual implementation output format. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cpsievert force-pushed the feat/py-lazy-frames branch from bb84cdd to 438c92e Compare January 14, 2026 16:45

cpsievert marked this pull request as draft January 14, 2026 16:49

cpsievert and others added 3 commits January 14, 2026 10:51

chore: add polars to dev dependencies, remove skip logic

36e4d61

cpsievert force-pushed the feat/py-lazy-frames branch from 438c92e to e5555b3 Compare January 14, 2026 16:51

cpsievert and others added 3 commits January 14, 2026 10:56

fix(pkg-py): restore noqa comment for A005 lint rule

0577842

The noqa: A005 comment was accidentally removed from types/__init__.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs: add LazyFrame section and demo script

582f26d

This was referenced Jan 14, 2026

feat(pkg-py): Add IbisSource for Ibis Table support #193

Merged

QueryChat(engine, *_raw) takes too long to run #182

Closed

Better support for reactive data sources #184

Closed

Merge main into feat/py-lazy-frames

76409f7

cpsievert requested a review from Copilot January 15, 2026 00:20

This comment was marked as resolved.

Sign in to view

This comment has been minimized.

Sign in to view

cpsievert and others added 10 commits January 15, 2026 09:20

undo unnecessary gitignore

f737755

Cleanup changelog

45e1ca5

docs(pkg-py): add docstrings for ColumnMeta attributes

5bfd083

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

docs(pkg-py): improve LazyFrame performance explanation

1175d98

Replace specific benchmark numbers with qualitative explanation of lazy evaluation benefits (deferred loading, query optimization, reduced memory). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

fix(pkg-py): always show range for numeric/date columns in PolarsLazy…

c2ae48d

…Source Remove conditional that skipped range output when both min/max were None, matching DataFrameSource behavior of always showing range info. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cpsievert temporarily deployed to pypi January 15, 2026 19:52 — with GitHub Actions Inactive

fix(pkg-py): update tests for narwhals DataFrame requirement

5dc37b0

- Update polars tests to wrap DataFrames with nw.from_native() - Fix df_to_html test to match actual truncation message format Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

cpsievert temporarily deployed to pypi January 15, 2026 19:56 — with GitHub Actions Inactive

cpsievert requested a review from Copilot January 15, 2026 19:57

This comment was marked as resolved.

Sign in to view

cpsievert temporarily deployed to pypi January 15, 2026 20:01 — with GitHub Actions Inactive

Simplify

f05dc81

cpsievert temporarily deployed to pypi January 15, 2026 20:12 — with GitHub Actions Inactive

cpsievert commented Jan 15, 2026

View reviewed changes

Comment thread pkg-py/src/querychat/tools.py Outdated

cpsievert commented Jan 15, 2026

View reviewed changes

Comment thread pkg-py/src/querychat/tools.py Outdated

Revert unnecessary change

8856fcc

cpsievert temporarily deployed to pypi January 15, 2026 20:48 — with GitHub Actions Inactive

Better name on union type

1d37c06

cpsievert had a problem deploying to pypi January 15, 2026 20:50 — with GitHub Actions Failure

Apply suggestions from code review

2f727a5

cpsievert had a problem deploying to pypi January 15, 2026 20:51 — with GitHub Actions Failure

cpsievert commented Jan 15, 2026

View reviewed changes

cpsievert temporarily deployed to pypi January 15, 2026 21:01 — with GitHub Actions Inactive

cpsievert marked this pull request as ready for review January 15, 2026 21:05

cpsievert temporarily deployed to pypi January 15, 2026 21:05 — with GitHub Actions Inactive

cpsievert requested a review from gadenbuie January 15, 2026 21:05

cpsievert merged commit f01820f into main Jan 15, 2026
13 checks passed

cpsievert deleted the feat/py-lazy-frames branch January 15, 2026 21:09

cpsievert mentioned this pull request Jan 16, 2026

Request polars dataframe support in qc_vals.df() format #183

Closed

cpsievert mentioned this pull request Jan 26, 2026

refactor(pkg-py): SQLAlchemySource.get_schema() to use ColumnMeta pattern #203

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(pkg-py): add PolarsLazySource for Polars LazyFrame support#191

feat(pkg-py): add PolarsLazySource for Polars LazyFrame support#191
cpsievert merged 25 commits intomainfrom
feat/py-lazy-frames

cpsievert commented Jan 14, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment has been minimized.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

cpsievert Jan 15, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

cpsievert commented Jan 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Changes

Performance

Test plan

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment has been minimized.

This comment was marked as resolved.

Uh oh!

Uh oh!

Uh oh!

cpsievert Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cpsievert commented Jan 14, 2026 •

edited

Loading

cpsievert Jan 15, 2026 •

edited

Loading