Description
Add a flatten_structure parameter to GCSToS3Operator that removes directory structure from transferred files, uploading only the filename to the S3 destination path.
Use case/motivation
Current Behavior:
The GCSToS3Operator always preserves the full GCS object path (including the prefix) when uploading to S3, regardless of the keep_directory_structure setting.
For example:
GCSToS3Operator(
gcs_bucket="my-bucket",
prefix="data/2025/01/15/file.parquet",
dest_s3_key="s3://target-bucket/processed/2025/01/15/"
)
# GCS files: "data/2025/01/15/file.parquet"
# Results in: s3://target-bucket/processed/2025/01/15/data/2025/01/15/file.parquet
# ^^^^^^^^^^^^^^^^^^^^^^^^^^^
# Unwanted path duplication!
It can lead to unwanted path duplication when users want to reorganize directory structures.
This makes it impossible to reorganize file structure during transfer without creating intermediate buckets or complex workarounds.
Desired Behavior:
With flatten_structure=True, only the filename would be uploaded, eliminating path duplication as well:
GCSToS3Operator(
gcs_bucket="my-bucket",
prefix="data/2025/01/15/file.parquet",
dest_s3_key="s3://target-bucket/processed/2025/01/15/",
flatten_structure=True
)
# GCS files: "data/2025/01/15/file.parquet"
# Results in: s3://target-bucket/processed/2025/01/15/file.parquet
# ^^^^^^^^^^^^^^^^^^^^^^^^
# Clean, organized path!
Implementation:
def _transform_file_path(self, file_path: str) -> str:
if self.flatten_structure:
return os.path.basename(file_path)
return file_path
This feature enables:
- Flexible path reorganization during cross-cloud transfers
- Cleaner S3 directory structures without GCS-specific paths
- Simplified integration with legacy systems expecting flat structures
- Eliminates need for post-processing scripts
- Reduced storage complexity and improved performance in S3 LIST operations
Related issues
No response
Are you willing to submit a PR?
Code of Conduct
Description
Add a
flatten_structureparameter to GCSToS3Operator that removes directory structure from transferred files, uploading only the filename to the S3 destination path.Use case/motivation
Current Behavior:
The
GCSToS3Operatoralways preserves the full GCS object path (including the prefix) when uploading to S3, regardless of thekeep_directory_structuresetting.For example:
It can lead to unwanted path duplication when users want to reorganize directory structures.
This makes it impossible to reorganize file structure during transfer without creating intermediate buckets or complex workarounds.
Desired Behavior:
With
flatten_structure=True, only the filename would be uploaded, eliminating path duplication as well:Implementation:
This feature enables:
Related issues
No response
Are you willing to submit a PR?
Code of Conduct