Description
When using an operator that is derived from BaseSQLToGCSOperator with output_format=parquet, the default parquet_row_group_size is 1. This seems like a very strange default setting and with these settings (in my experience) it leads to some very unwanted results: enormous Parquet files, workers running out of memory and long task durations.
I know this parameter is configurable, but my point is that this default setting should be changed to something more usable out of the box.
Use case/motivation
I looked up some other Parquet writing system's default settings. Spark seems to default to 128MB row groups. DuckDB has a default setting of 122.880 rows per row group according to the docs, and Polars uses a default setting of 512^2 rows.
So I think considering this and the unwanted effects I noticed of having 1 row per row group, I'd say the default setting should be changed. However, I'm not sure what would be a good default setting instead for this Airflow operator.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
Description
When using an operator that is derived from
BaseSQLToGCSOperatorwithoutput_format=parquet, the defaultparquet_row_group_sizeis 1. This seems like a very strange default setting and with these settings (in my experience) it leads to some very unwanted results: enormous Parquet files, workers running out of memory and long task durations.I know this parameter is configurable, but my point is that this default setting should be changed to something more usable out of the box.
Use case/motivation
I looked up some other Parquet writing system's default settings. Spark seems to default to 128MB row groups. DuckDB has a default setting of 122.880 rows per row group according to the docs, and Polars uses a default setting of 512^2 rows.
So I think considering this and the unwanted effects I noticed of having 1 row per row group, I'd say the default setting should be changed. However, I'm not sure what would be a good default setting instead for this Airflow operator.
Related issues
No response
Are you willing to submit a PR?
Code of Conduct