KubernetesPodOperator: Fix hanging API calls #60254
Conversation
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Outdated
Show resolved
Hide resolved
275d65b to
25ec50c
Compare
|
Overall solid PR. The wrapper class is a great approach. But there is one potential interaction I wanted to flag based on reading through the code path.. If you look at In idle or stalled watch scenarios (i.e. no events received for a period), this effectively means the API timeout can act as a hard upper bound on how long the watch remains open if it is lower than Normally, this would not be too concerning but as |
jscheffl
left a comment
There was a problem hiding this comment.
Some typing hint nits, else LGTM!
Also might be good to consider the comments from @SameerMesiah97 - mainy documenting this (maybe inline in code?) I know that watches are limited to 30s which was the reason for the timeout and adding the 60s hard limit still I assume is beneficial to make it fail-safe.
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Outdated
Show resolved
Hide resolved
providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py
Outdated
Show resolved
Hide resolved
providers/google/src/airflow/providers/google/cloud/hooks/kubernetes_engine.py
Outdated
Show resolved
Hide resolved
|
@jscheffl Thank you for your input. This is a good suggestion, and I've implemented the changes accordingly. @SameerMesiah97 Thank you for the feedback. I've implemented a dynamic approach where the client-side timeout now depends on the server-side timeout. If the server-side timeout exceeds the default API_TIMEOUT, the client-side timeout is set to the server-side timeout plus an offset. This ensures the server has adequate time to respond before the client times out. I believe this is a more robust solution than simply documenting the behavior. |
a3f90ab to
718a59d
Compare
Great. That is even better as it makes the timeout behave more in line with user expectations. I do not want to sound pedantic here but I think this dynamic approach towards handling conflicting timeouts should be mentioned in the docstring for
Though the 5 second default is quite short, users scouring the logs might notice the discrepancy and think this is unintentional. |
jscheffl
left a comment
There was a problem hiding this comment.
I like it as it is already. But adding a small note does not hurt - @AutomationDev85 can you apply this nit? Then I am able to merge to the next providers wave by tomorrow.
|
@SameerMesiah97 I've adopted your text—thanks for the great improvement! |
* Kubernetes API calls use timeouts * Add wrapper Api clients to add default timeout * fix mypy and docu issues * Fixed typing and rework server side and client timeout * Extend function note --------- Co-authored-by: AutomationDev85 <AutomationDev85>
* Kubernetes API calls use timeouts * Add wrapper Api clients to add default timeout * fix mypy and docu issues * Fixed typing and rework server side and client timeout * Extend function note --------- Co-authored-by: AutomationDev85 <AutomationDev85>
* Kubernetes API calls use timeouts * Add wrapper Api clients to add default timeout * fix mypy and docu issues * Fixed typing and rework server side and client timeout * Extend function note --------- Co-authored-by: AutomationDev85 <AutomationDev85>
* Kubernetes API calls use timeouts * Add wrapper Api clients to add default timeout * fix mypy and docu issues * Fixed typing and rework server side and client timeout * Extend function note --------- Co-authored-by: AutomationDev85 <AutomationDev85>
* Kubernetes API calls use timeouts * Add wrapper Api clients to add default timeout * fix mypy and docu issues * Fixed typing and rework server side and client timeout * Extend function note --------- Co-authored-by: AutomationDev85 <AutomationDev85>
Overview
We observed intermittent hangs in Kubernetes API communication from KubernetesPodOperator, with calls stalling for over a day until tasks were manually stopped. Investigation showed the Kubernetes Python client wasn’t using a client-side timeout, so stalled connections could block indefinitely. To improve robustness, we add a client-side timeout to API calls so they raise a clear exception instead of leaving tasks hanging. This does not fix the underlying cluster/API issue, but it makes failures detectable and recoverable.
We chose a 60-second timeout: long enough to tolerate load, short enough to prevent indefinite hangs.
Change Summary