Use reproducible builds for provider packages#35685
Closed
potiuk wants to merge 2 commits intoapache:mainfrom
Closed
Use reproducible builds for provider packages#35685potiuk wants to merge 2 commits intoapache:mainfrom
potiuk wants to merge 2 commits intoapache:mainfrom
Conversation
0fa8df9 to
e104ff5
Compare
potiuk
commented
Nov 16, 2023
Member
Author
There was a problem hiding this comment.
This was not used so I removed it.
potiuk
commented
Nov 16, 2023
Member
Author
There was a problem hiding this comment.
40b1938 to
c3ec014
Compare
This is a follow-up after apache#35586 and it depends on this one. It moves the whole functionality of preparing provider packages to breeze, removing the need of doing it in the Breeze CI image. Since we have Python breeze with its own environment managed via `pipx` we can now make sure that all the necessary packages are installed in this environment and run package building in the same environment Breeze uses. Previously we have been running all the package building inside the CI image for two reasons: * we could rely on the same version of build tools (wheel/setuptools) being installed in the CI image * security of the provider package preparation that used setuptools pre PEP-517 way of building packages that executed setup.py code In order to isolate execution of potentially arbitrary code in setup.py from the HOST environment in CI - where the host environment might have access to secrets and tokens that would allow it to break out of the sandbox for PRs coming from forks. The setup.py file has been prepared by breeze using JINJA templates but it was potentially possible to manipulate provider package directory structure and get "Python" injection into generated setup.py, so it was safer to run it in the isolated Breeze CI environment. This PR makes it secure to run it in the Host environment, because instead of generating setup.cfg and setup.py we generate pyproject.toml with all the necessary information and we are using PEP-517 compliant way of building provider packages - no arbitrary code executed via setup.py is possible this way on the host, so we can safely build provider packages in the host. We are generating declarative pyproject.toml for that rather than imperative setup.py, so we are safe to run the build process in the host without being afraid of executing arbitrary code. We are using flit as build tool - this is one of the popular build tools - created by Python Packaging team. It is simple and not too opinionated, it supports PEP-517 as well as PEP-621, so most of the project mnetadata in pyproject toml can be added to PEP-621 compliant "project" section of pyproject.toml. Together with the change we improves the process of generation of the extracted sources for the providers. Originally we copied the whole sources of Airflow to a single directory (provider_packages) and run sequentially provider packages building from that single directory, however it made it impossible to parallelise such builds - all providers had to be built sequentially. We change the approach now - instead of copying all airflow sources once to the single directory, we build providers in separate subdirectories of files/provider_packages/PROVIDER_ID and we only copy there relevant sources (i.e. only provider's subfolder from the "airflow/providers". This is quite a bit faster (each provider only gets built using only its own sources so just scanning the directory is faster) but it also allows to run package preparation in parallel because each provider is fully isolated from others. This PR also excludes not-needed `prepare_providers_package.py` and unneded `provider_packages` folder used to prepare providers before as well as bash script to build the providers and some unused bash functions.
c3ec014 to
add0f14
Compare
Flit allows to build reproducible packages (packages that can be compared bit-by-bit) providing that source date epoch is set to repeatable value when package is built. This PR implements reproducibility of our builds by freezing the documentation preparation time in provider.yaml as "source date epoch" and always using it when building the package. This way anyone using breeze to build the package will have exactly the same binary package produced, which will make it way easier to verify if the packages are ready for release by the PMC member. We will no longer have to check the sources, PMC members will simply need to build the same packages locally using breeze and see if the generated packages are exactly the same. The "source-date-epoch" fields have been regenerated in this PR as well. Also this PR replaces `lru_cache` method of storing output of `get_provider_metadata_packages` with custom-stored dictionary - thanks to that instead of invalidating whole cache of providers metadata refreshed from yaml files we can refresh individual provider metadata entries after they have been updated. This saves a lot of time for validation - because every time when provider yaml is updated we need to re-read it and re-validate it with json schema, with this change we only do it for the updated provider yaml - which saves about 0.5 a second per provider yaml update and when you update all provides it is done way faster.
add0f14 to
68100ef
Compare
Member
Author
|
Need to wait with PROD build until #35617 gets merged |
Member
Author
|
Closing for #35693 to run it from Apache repository - to get the build PROD image working. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Flit allows to build reproducible packages (packages that can
be compared bit-by-bit) providing that source date epoch is
set to repeatable value when package is built. This PR implements
reproducibility of our builds by freezing the documentation
preparation time in provider.yaml as "source date epoch" and
always using it when building the package. This way anyone
using breeze to build the package will have exactly the same
binary package produced, which will make it way easier to
verify if the packages are ready for release by the PMC member.
We will no longer have to check the sources, PMC members will
simply need to build the same packages locally using breeze and
see if the generated packages are exactly the same.
Based on #35617 so it should only be merged after that one
(Only last commit counts)
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.