KubernetesPodOperator: Fix hanging API calls by AutomationDev85 · Pull Request #60254 · apache/airflow

AutomationDev85 · 2026-01-08T09:21:44Z

Overview

We observed intermittent hangs in Kubernetes API communication from KubernetesPodOperator, with calls stalling for over a day until tasks were manually stopped. Investigation showed the Kubernetes Python client wasn’t using a client-side timeout, so stalled connections could block indefinitely. To improve robustness, we add a client-side timeout to API calls so they raise a clear exception instead of leaving tasks hanging. This does not fix the underlying cluster/API issue, but it makes failures detectable and recoverable.

We chose a 60-second timeout: long enough to tolerate load, short enough to prevent indefinite hangs.

Change Summary

Set a 60-second client-side timeout for Kubernetes API requests.
Apply the timeout to sync and async wrapper API client to enable all API calls with the timeout.

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py

SameerMesiah97 · 2026-01-09T19:53:50Z

Overall solid PR. The wrapper class is a great approach.

But there is one potential interaction I wanted to flag based on reading through the code path.. If you look at def watch_pod_events (should be lines 1005-1054 in your branch), you will see that there is a parameter timeout_seconds which has a default argument of 30s (let's call this the 'watch timeout') . As your new wrapper class _TimeoutK8sApiClient (which sets the 'API timeout') is injected when def get_conn is called by def watch_pod_events.

In idle or stalled watch scenarios (i.e. no events received for a period), this effectively means the API timeout can act as a hard upper bound on how long the watch remains open if it is lower than timeout_seconds. In those cases, the watch timeout becomes a best-case limit rather than a guarantee.

Normally, this would not be too concerning but as def watch_pod_events is a public API, and this 'API timeout' is an internal default setting, it would not hurt to document this behavior.

jscheffl

Some typing hint nits, else LGTM!

Also might be good to consider the comments from @SameerMesiah97 - mainy documenting this (maybe inline in code?) I know that watches are limited to 30s which was the reason for the timeout and adding the 60s hard limit still I assume is beneficial to make it fail-safe.

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py

providers/google/src/airflow/providers/google/cloud/hooks/kubernetes_engine.py

AutomationDev85 · 2026-01-12T09:40:48Z

@jscheffl Thank you for your input. This is a good suggestion, and I've implemented the changes accordingly.

@SameerMesiah97 Thank you for the feedback. I've implemented a dynamic approach where the client-side timeout now depends on the server-side timeout. If the server-side timeout exceeds the default API_TIMEOUT, the client-side timeout is set to the server-side timeout plus an offset. This ensures the server has adequate time to respond before the client times out. I believe this is a more robust solution than simply documenting the behavior.

SameerMesiah97 · 2026-01-12T18:43:29Z

@jscheffl Thank you for your input. This is a good suggestion, and I've implemented the changes accordingly.

@SameerMesiah97 Thank you for the feedback. I've implemented a dynamic approach where the client-side timeout now depends on the server-side timeout. If the server-side timeout exceeds the default API_TIMEOUT, the client-side timeout is set to the server-side timeout plus an offset. This ensures the server has adequate time to respond before the client times out. I believe this is a more robust solution than simply documenting the behavior.

Great. That is even better as it makes the timeout behave more in line with user expectations.

I do not want to sound pedantic here but I think this dynamic approach towards handling conflicting timeouts should be mentioned in the docstring for def watch_pod_events. Maybe something like:

timeout_seconds – Timeout in seconds for the watch stream. A small additional buffer may be applied internally.

Though the 5 second default is quite short, users scouring the logs might notice the discrepancy and think this is unintentional.

jscheffl

I like it as it is already. But adding a small note does not hurt - @AutomationDev85 can you apply this nit? Then I am able to merge to the next providers wave by tomorrow.

AutomationDev85 · 2026-01-13T07:47:25Z

@SameerMesiah97 I've adopted your text—thanks for the great improvement!

* Kubernetes API calls use timeouts * Add wrapper Api clients to add default timeout * fix mypy and docu issues * Fixed typing and rework server side and client timeout * Extend function note --------- Co-authored-by: AutomationDev85 <AutomationDev85>

Kubernetes API calls use timeouts

06cd6d8

AutomationDev85 requested review from hussein-awala and jedcunningham as code owners January 8, 2026 09:21

boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jan 8, 2026

jscheffl reviewed Jan 8, 2026

View reviewed changes

providers/cncf/kubernetes/src/airflow/providers/cncf/kubernetes/hooks/kubernetes.py Outdated Show resolved Hide resolved

AutomationDev85 added 2 commits January 9, 2026 11:02

Add wrapper Api clients to add default timeout

3bd53de

fix mypy and docu issues

25ec50c

AutomationDev85 force-pushed the bugfix/kubernetes-api-calls-use-timeout branch from 275d65b to 25ec50c Compare January 9, 2026 14:01

jscheffl reviewed Jan 9, 2026

View reviewed changes

Fixed typing and rework server side and client timeout

718a59d

AutomationDev85 force-pushed the bugfix/kubernetes-api-calls-use-timeout branch from a3f90ab to 718a59d Compare January 12, 2026 10:57

jscheffl approved these changes Jan 12, 2026

View reviewed changes

Extend function note

dc1e483

jscheffl approved these changes Jan 13, 2026

View reviewed changes

jscheffl merged commit 3f3d0f5 into apache:main Jan 13, 2026
242 of 247 checks passed

jscheffl mentioned this pull request Jan 14, 2026

Status of testing Providers that were prepared on January 13th, 2026 #60496

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KubernetesPodOperator: Fix hanging API calls #60254

KubernetesPodOperator: Fix hanging API calls #60254
jscheffl merged 5 commits intoapache:mainfrom
boschglobal:bugfix/kubernetes-api-calls-use-timeout

AutomationDev85 commented Jan 8, 2026 •

edited

Loading

Uh oh!

Uh oh!

SameerMesiah97 commented Jan 9, 2026 •

edited

Loading

Uh oh!

jscheffl left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AutomationDev85 commented Jan 12, 2026

Uh oh!

SameerMesiah97 commented Jan 12, 2026

Uh oh!

jscheffl left a comment

Uh oh!

AutomationDev85 commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

AutomationDev85 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Change Summary

Uh oh!

Uh oh!

SameerMesiah97 commented Jan 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

jscheffl left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AutomationDev85 commented Jan 12, 2026

Uh oh!

SameerMesiah97 commented Jan 12, 2026

Uh oh!

jscheffl left a comment

Choose a reason for hiding this comment

Uh oh!

AutomationDev85 commented Jan 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

AutomationDev85 commented Jan 8, 2026 •

edited

Loading

SameerMesiah97 commented Jan 9, 2026 •

edited

Loading