Add retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator#39201
Conversation
|
why don't you use the internal retry parameter of airflow ? |
I use the same approach what we use for retry Pod creation: https://github.com/apache/airflow/blob/main/airflow/providers/cncf/kubernetes/utils/pod_manager.py#L347C1-L356C1 |
|
I was thinking about the BaseOperator argument PythonOperator(
task_id="aa",
retries=3,
python_callable=toto,
) |
|
related : #15137 |
This is an option for users. If a user wants to retry a specific task, they can use this parameter. Here, if I understand correctly, @MaksYermak wants to retry without the user being aware or needing to do something. |
vincbeck
left a comment
There was a problem hiding this comment.
Could you add tests to cover these retries?
f265b2a to
798a60d
Compare
Sure, I have added a unit tests. |
798a60d to
1941e94
Compare
1941e94 to
8e07ee0
Compare
|
Hi @raphaelauv @dirrao @vincbeck ! |
In this PR I have added retry logic for KubernetesCreateResourceOperator and KubernetesJobOperator.
This logic is needed for preventing 'No agent available' error. The error appears time to time when users try to create a Resource or Job. This issue is inside Kubernetes and in the current moment has no solution. Like a temporary solution we decided to retry Job or Resource creation request each time when this error appears.
Link for the same issue for cert-manager service: cert-manager/cert-manager#6457
^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named
{pr_number}.significant.rstor{issue_number}.significant.rst, in newsfragments.