Skip to content

Reducing memory footprint for synchronous S3KeySensor#55070

Merged
ashb merged 9 commits intoapache:mainfrom
jroachgolf84:issue-55039
Sep 18, 2025
Merged

Reducing memory footprint for synchronous S3KeySensor#55070
ashb merged 9 commits intoapache:mainfrom
jroachgolf84:issue-55039

Conversation

@jroachgolf84
Copy link
Copy Markdown
Collaborator

This PR aims to reduce the memory footprint for the S3KeySensor. This was done by altering the get_file_metadata method in the S3Hook to yield records in the paginated response, rather than loading them into a single list. The return type for the get_file_metadata method is not an Iterator. An assertion was added to validate this, and all appropriate tests were updated.

closes: #55039

@boring-cyborg boring-cyborg bot added area:providers provider:amazon AWS/Amazon - related issues labels Aug 29, 2025
@jroachgolf84 jroachgolf84 requested review from dstandish and removed request for eladkal and o-nikolas August 29, 2025 13:53
@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

cc: @dstandish

@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

@eladkal, since we're changing the return type of get_file_metadata, would this need to be part of a major release?

@dstandish
Copy link
Copy Markdown
Contributor

mentioned it on slack but just adding here

should check with @eladkal re backcompat issues. changing get_file_metadata to iterator likely means major release. and given that maybe we should update the whole check key funciton logic so that it never has to hold everything in memory, i.e. by making it so that check_fn takes a key and not a list

but thinking about that..... the problem is it's somewhat ambiguous what the behavior should be

for best performance, you would want check_fn to return on first "pass"

but, the behavior that would be most similar to current, would be to return all the files for which check_fn evaluates to true

@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

cc: @eladkal

@eladkal
Copy link
Copy Markdown
Contributor

eladkal commented Sep 2, 2025

Lets consult first with @o-nikolas and @vincbeck.

@vincbeck
Copy link
Copy Markdown
Contributor

vincbeck commented Sep 2, 2025

To keep it backward compatible, could we make get_file_metadata return list | Iterator? We would need to handle both cases in providers/amazon/src/airflow/providers/amazon/aws/sensors/s3.py which I do not really like but that's okay. What do you think?

@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

I think that's feasible - how would we determine whether to return a list or an Iterator?

@vincbeck
Copy link
Copy Markdown
Contributor

vincbeck commented Sep 3, 2025

In the current implementation we would always return an iterator but having list | Iterator as return type annotation would keep it backward compatible

@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

In the current implementation we would always return an iterator but having list | Iterator as return type annotation would keep it backward compatible

Making the type annotation change now. Both cases are handled in providers/amazon/src/airflow/providers/amazon/aws/sensors/s3.py.

Copy link
Copy Markdown
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually sorry but I realized I provided wrong instructions. This does not solve back compat issues ... What we should do instead is deprecate get_file_metadata and create a new one using Iterator and use this one in the sensor. That way we do not need to create a major release because of this.

@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

jroachgolf84 commented Sep 4, 2025

Got it - so the plan would be to:

  1. Deprecate get_file_metadata with a deprecation warning?
  2. Create a new get_file_metadata method that returns an Iterator, and use that in the Sensor.

What is the best way to deprecate a method and create one with the same name? Or should I create a new method with a new name, and use this one in the Sensor?

@vincbeck
Copy link
Copy Markdown
Contributor

vincbeck commented Sep 4, 2025

Got it - so the plan would be to:

Deprecate get_file_metadata with a deprecation warning?
Create a new get_file_metadata method that returns an Iterator, and use that in the Sensor.

Correct

What is the best way to deprecate a method and create one with the same name?

To deprecate a method, please emit deprecation warning at the beginning of the method. Example below:

warnings.warn(
            "The property `async_conn` is deprecated. Accessing it in an async context will cause the event loop to block. "
            "Use the async method `get_async_conn` instead.",
            AirflowProviderDeprecationWarning,
            stacklevel=2,
        )

Or should I create a new method with a new name, and use this one in the Sensor?

Yes, you should create a new one with a new name

@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

@vincbeck, I've updated the PR accordingly!

Copy link
Copy Markdown
Contributor

@vincbeck vincbeck left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Thanks for the quick turn around!

@dstandish
Copy link
Copy Markdown
Contributor

Actually sorry but I realized I provided wrong instructions. This does not solve back compat issues ... What we should do instead is deprecate get_file_metadata and create a new one using Iterator and use this one in the sensor. That way we do not need to create a major release because of this.

I guess just wondering what is wrong with making a major release? How do we determine when to proceed with one?

@vincbeck
Copy link
Copy Markdown
Contributor

vincbeck commented Sep 4, 2025

Actually sorry but I realized I provided wrong instructions. This does not solve back compat issues ... What we should do instead is deprecate get_file_metadata and create a new one using Iterator and use this one in the sensor. That way we do not need to create a major release because of this.

I guess just wondering what is wrong with making a major release? How do we determine when to proceed with one?

Nothing wrong about it, we are just trying to minimize them. Upgrading major version can be painful for users and require some work so we are trying to do that not so often.

@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

@dstandish, can you take a look at the updated PR?

@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

@eladkal, can you take a look at this for me?

@jroachgolf84
Copy link
Copy Markdown
Collaborator Author

@ashb should be all set here!

@ashb ashb merged commit 6a10bc6 into apache:main Sep 18, 2025
78 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:amazon AWS/Amazon - related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

s3 key sensor (sync version) has bad memory footprint

6 participants