Conversation
Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
|
@copilot |
Added in commit ca54e77 (sort_by) and the latest commit. The
Example: BatchWriter(
inner_writer=DiskWriter,
dataset="data",
format="parquet",
use_dictionary=["category", "status"] # Only these columns use dictionary encoding
) |
Co-authored-by: joocer <1688479+joocer@users.noreply.github.com>
There was a problem hiding this comment.
Pull Request Overview
This PR enhances the parquet writer with two new features: 1) extending the sort_by parameter to accept a list of strings for multi-column ascending sorting, and 2) adding a new use_dictionary parameter to control dictionary encoding for parquet columns.
- Added support for
List[str]format insort_byparameter for cleaner multi-column sorting syntax - Introduced
use_dictionaryparameter with support for boolean values and column-specific lists - Updated type hints and conversion logic to handle the new parameter formats
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
| tests/test_writer_parquet_features.py | Added comprehensive test coverage for both new features including multi-column sorting and dictionary encoding scenarios |
| mabel/data/writers/internals/blob_writer.py | Updated constructor parameters, type hints, and parquet writing logic to support the new features |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
| schema: Optional[RelationSchema] = None, | ||
| parquet_row_group_size: int = 5000, | ||
| sort_by: Optional[str] = None, | ||
| sort_by: Optional[Union[str, List]] = None, |
There was a problem hiding this comment.
The type hint List is too generic. It should be List[Union[str, Tuple[str, str]]] to clearly indicate it accepts either a list of column names or a list of tuples with column name and sort direction.
| pytable = pytable.sort_by(self.sort_by) | ||
| # Convert list of strings to PyArrow format | ||
| sort_spec = self.sort_by | ||
| if isinstance(self.sort_by, list) and all(isinstance(item, str) for item in self.sort_by): |
There was a problem hiding this comment.
This validation logic should be extracted to a separate method or moved to the constructor for earlier validation and better error handling. Currently, invalid sort_by values would only be caught during commit().
Overview
Enhanced the parquet writer with two new features:
sort_byparameter now accepts a list of strings for multi-column sorting, making it easier to sort data by multiple columns in ascending order.use_dictionaryparameter to control dictionary encoding for parquet columns.What Changed
Sort By Parameter
Previously, the
sort_byparameter only supported:None- no sorting (insertion order)str- sort by a single columnList[Tuple[str, str]]- sort by multiple columns with explicit directionNow it also supports:
List[str]- sort by multiple columns (all ascending)Dictionary Encoding Parameter (NEW)
Added
use_dictionaryparameter that supports:None(default) - PyArrow's default behavior (True)True- enable dictionary encoding for all columnsFalse- disable dictionary encoding for all columnsList[str]- enable dictionary encoding only for specified columnsExample Usage
Sort By
Dictionary Encoding
Implementation Details
sort_byfromOptional[str]toOptional[Union[str, List]]List[str]to PyArrow's expected format[(col, "ascending")]use_dictionaryparameter with typeOptional[Union[bool, List[str]]]write_tablecall to passuse_dictionarywhen specifiedTesting
Added comprehensive tests covering:
sort_by=["id"]sort_by=["category", "id"]use_dictionary=Trueuse_dictionary=Falseuse_dictionary=["category", "status"]All existing tests continue to pass.
Related
Fixes the feature request for accepting list of strings in the
sort_byparameter for easier multi-column sorting and adds dictionary encoding control for optimizing parquet file size and performance.Original prompt
💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.