Skip to content

Add support for list of strings in sort_by and dictionary encoding in parquet writer#421

Merged
joocer merged 5 commits intomainfrom
copilot/fix-21057b26-b99a-47df-8604-20910c37976a
Oct 8, 2025
Merged

Add support for list of strings in sort_by and dictionary encoding in parquet writer#421
joocer merged 5 commits intomainfrom
copilot/fix-21057b26-b99a-47df-8604-20910c37976a

Conversation

Copy link
Contributor

Copilot AI commented Oct 6, 2025

Overview

Enhanced the parquet writer with two new features:

  1. The sort_by parameter now accepts a list of strings for multi-column sorting, making it easier to sort data by multiple columns in ascending order.
  2. Added use_dictionary parameter to control dictionary encoding for parquet columns.

What Changed

Sort By Parameter

Previously, the sort_by parameter only supported:

  • None - no sorting (insertion order)
  • str - sort by a single column
  • List[Tuple[str, str]] - sort by multiple columns with explicit direction

Now it also supports:

  • List[str] - sort by multiple columns (all ascending)

Dictionary Encoding Parameter (NEW)

Added use_dictionary parameter that supports:

  • None (default) - PyArrow's default behavior (True)
  • True - enable dictionary encoding for all columns
  • False - disable dictionary encoding for all columns
  • List[str] - enable dictionary encoding only for specified columns

Example Usage

Sort By

from mabel.data import BatchWriter
from mabel.adapters.disk import DiskWriter

# Before: Required verbose tuple syntax for multi-column sorting
writer = BatchWriter(
    inner_writer=DiskWriter,
    dataset="my_dataset",
    format="parquet",
    sort_by=[("category", "ascending"), ("price", "ascending")]  # Verbose
)

# Now: Simple list of strings for ascending multi-column sort
writer = BatchWriter(
    inner_writer=DiskWriter,
    dataset="my_dataset",
    format="parquet",
    sort_by=["category", "price"]  # Much cleaner!
)

Dictionary Encoding

# Enable dictionary encoding for specific columns (efficient for low-cardinality columns)
writer = BatchWriter(
    inner_writer=DiskWriter,
    dataset="my_dataset",
    format="parquet",
    use_dictionary=["category", "status"]  # Only these columns use dictionary encoding
)

Implementation Details

  • Updated type hint for sort_by from Optional[str] to Optional[Union[str, List]]
  • Added conversion logic to transform List[str] to PyArrow's expected format [(col, "ascending")]
  • Added use_dictionary parameter with type Optional[Union[bool, List[str]]]
  • Updated parquet write_table call to pass use_dictionary when specified
  • Maintains full backward compatibility with all existing formats
  • No breaking changes

Testing

Added comprehensive tests covering:

  • Single column sorting with list syntax: sort_by=["id"]
  • Multi-column sorting with list syntax: sort_by=["category", "id"]
  • Dictionary encoding enabled for all columns: use_dictionary=True
  • Dictionary encoding disabled for all columns: use_dictionary=False
  • Dictionary encoding for specific columns: use_dictionary=["category", "status"]
  • Verification that existing functionality (None, string, list of tuples) still works
  • Both features working together

All existing tests continue to pass.

Related

Fixes the feature request for accepting list of strings in the sort_by parameter for easier multi-column sorting and adds dictionary encoding control for optimizing parquet file size and performance.

Original prompt

The parquet writer allows us to specify a sort column, can we update this to accept:

None (default) - write as exists in the WAL
string - order by the given column
list of strings - order by the first column, then the second column etc etc


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
Copilot AI changed the title [WIP] The parquet writer allows us to specify a sort column, can we update this to accept: None (default) - write as exists in the WAL string - order by the given column list of strings - order by the first column, then the second column etc etc Add support for list of strings in parquet writer sort_by parameter Oct 6, 2025
Copilot AI requested a review from joocer October 6, 2025 13:40
@joocer
Copy link
Member

joocer commented Oct 6, 2025

@copilot
Can we also add the ability to define which columns should be dictionary-encoded?

Copy link
Contributor Author

Copilot AI commented Oct 6, 2025

@copilot
Can we also add the ability to define which columns should be dictionary-encoded?

Added in commit ca54e77 (sort_by) and the latest commit. The use_dictionary parameter now supports:

  • True - enable for all columns
  • False - disable for all columns
  • ["col1", "col2"] - enable only for specified columns

Example:

BatchWriter(
    inner_writer=DiskWriter,
    dataset="data",
    format="parquet",
    use_dictionary=["category", "status"]  # Only these columns use dictionary encoding
)

Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
Copilot AI changed the title Add support for list of strings in parquet writer sort_by parameter Add support for list of strings in sort_by and dictionary encoding in parquet writer Oct 6, 2025
@joocer joocer marked this pull request as ready for review October 8, 2025 19:57
Copilot AI review requested due to automatic review settings October 8, 2025 19:57
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances the parquet writer with two new features: 1) extending the sort_by parameter to accept a list of strings for multi-column ascending sorting, and 2) adding a new use_dictionary parameter to control dictionary encoding for parquet columns.

  • Added support for List[str] format in sort_by parameter for cleaner multi-column sorting syntax
  • Introduced use_dictionary parameter with support for boolean values and column-specific lists
  • Updated type hints and conversion logic to handle the new parameter formats

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.

File Description
tests/test_writer_parquet_features.py Added comprehensive test coverage for both new features including multi-column sorting and dictionary encoding scenarios
mabel/data/writers/internals/blob_writer.py Updated constructor parameters, type hints, and parquet writing logic to support the new features

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

schema: Optional[RelationSchema] = None,
parquet_row_group_size: int = 5000,
sort_by: Optional[str] = None,
sort_by: Optional[Union[str, List]] = None,
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The type hint List is too generic. It should be List[Union[str, Tuple[str, str]]] to clearly indicate it accepts either a list of column names or a list of tuples with column name and sort direction.

Copilot uses AI. Check for mistakes.
pytable = pytable.sort_by(self.sort_by)
# Convert list of strings to PyArrow format
sort_spec = self.sort_by
if isinstance(self.sort_by, list) and all(isinstance(item, str) for item in self.sort_by):
Copy link

Copilot AI Oct 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This validation logic should be extracted to a separate method or moved to the constructor for earlier validation and better error handling. Currently, invalid sort_by values would only be caught during commit().

Copilot uses AI. Check for mistakes.
@joocer joocer merged commit 889550b into main Oct 8, 2025
3 of 12 checks passed
@joocer joocer deleted the copilot/fix-21057b26-b99a-47df-8604-20910c37976a branch October 8, 2025 20:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants