Skip to content

Fix duplicate pod creation in KubernetesJobOperator#53368

Merged
shahar1 merged 1 commit intoapache:mainfrom
stephen-bracken:stephen-bracken/cncf-kubernetes-job-fix
Jan 10, 2026
Merged

Fix duplicate pod creation in KubernetesJobOperator#53368
shahar1 merged 1 commit intoapache:mainfrom
stephen-bracken:stephen-bracken/cncf-kubernetes-job-fix

Conversation

@stephen-bracken
Copy link
Copy Markdown
Contributor

@stephen-bracken stephen-bracken commented Jul 15, 2025

Fix KubernetesJobOperator.get_or_create_pod() sometimes creating duplicate pods.

(re-raised from #52885)

during execute() the KubernetesJobOperator attempts to find the pod associated with the Job object using self.get_or_create_pod(). If Kubernetes is being slow then the Job object will not create a pod before this method gets called, which will result in the underlying find_pod() method returning None, and a duplicate headless pod being created for this task.

This PR removes references to the get_or_create_pod() method in favour of KubernetesJobOperator.get_pod() to prevent creating headless pods.


^ Add meaningful description above
Read the Pull Request Guidelines for more information.
In case of fundamental code changes, an Airflow Improvement Proposal (AIP) is needed.
In case of a new dependency, check compliance with the ASF 3rd Party License Policy.
In case of backwards incompatible changes please leave a note in a newsfragment file, named {pr_number}.significant.rst or {issue_number}.significant.rst, in airflow-core/newsfragments.

@boring-cyborg boring-cyborg bot added area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues labels Jul 15, 2025
@stephen-bracken stephen-bracken force-pushed the stephen-bracken/cncf-kubernetes-job-fix branch 2 times, most recently from 0e04769 to a9ad704 Compare July 15, 2025 15:00
@stephen-bracken stephen-bracken marked this pull request as ready for review July 15, 2025 15:00
@stephen-bracken stephen-bracken force-pushed the stephen-bracken/cncf-kubernetes-job-fix branch 4 times, most recently from d6d3fa4 to 8b43bca Compare July 17, 2025 14:15
Copy link
Copy Markdown
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall LGTM - got some minor comments.
Also, I'll be happy for an additional review :)

@stephen-bracken stephen-bracken force-pushed the stephen-bracken/cncf-kubernetes-job-fix branch 13 times, most recently from 30445de to f123b1a Compare July 23, 2025 20:06
@stephen-bracken
Copy link
Copy Markdown
Contributor Author

In reference to #49899 - I think a change is still necessary to avoid creating multiple pods when parallelism is not set. Do we want to change the control flow to use get_pods() in all circumstances?

@shahar1
Copy link
Copy Markdown
Contributor

shahar1 commented Jul 26, 2025

In reference to #49899 - I think a change is still necessary to avoid creating multiple pods when parallelism is not set. Do we want to change the control flow to use get_pods() in all circumstances?

You could start with handling only the case where parallelism=False, and later we could simplify the logic if it becomes necessary. Please rebase/merge changes from the main branch, adjust the logic and fix tests appropriately.

@github-actions
Copy link
Copy Markdown

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed in 5 days if no further activity occurs. Thank you for your contributions.

@github-actions github-actions bot added the stale Stale PRs per the .github/workflows/stale.yml policy file label Sep 10, 2025
@github-actions github-actions bot closed this Sep 15, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 2 out of 2 changed files in this pull request and generated 4 comments.

Comments suppressed due to low confidence (1)

providers/cncf/kubernetes/tests/unit/cncf/kubernetes/operators/test_job.py:921

  • The condition if do_xcom_push and not parallelism is checking if parallelism is falsy (0, None, False, etc.). However, with the change to default parallelism=1, this condition will never be true when parallelism=1 is set. The logic should check if do_xcom_push and parallelism == 1 instead to handle the single pod case correctly.
        if do_xcom_push and not parallelism:
            mock_extract_xcom.assert_called_once()
        elif do_xcom_push and parallelism is not None:
            assert mock_extract_xcom.call_count == parallelism
        else:
            mock_extract_xcom.assert_not_called()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almsost there!
I've revised my comment from a previous review (apologies for that!).
Please review my current comments + address Copilot comments.

@stephen-bracken stephen-bracken force-pushed the stephen-bracken/cncf-kubernetes-job-fix branch 3 times, most recently from 2c57cda to d561015 Compare January 2, 2026 10:09
@stephen-bracken stephen-bracken force-pushed the stephen-bracken/cncf-kubernetes-job-fix branch from d561015 to b0f5289 Compare January 2, 2026 15:58
@shahar1 shahar1 changed the title remove references to KubernetesJobOperator.get_or_create_pod() to fix creating duplicate pods Remove references to KubernetesJobOperator.get_or_create_pod() to fix creating duplicate pods Jan 3, 2026
@shahar1
Copy link
Copy Markdown
Contributor

shahar1 commented Jan 3, 2026

LGTM, I'm making sure with the other contributors that handling parallelism = 0 case makes sense:
https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1767429130216899

If no strong objections are given in the next 1-2 days, or there are additional approvals for this PR by then - I'm ok with merging it as-is.

@jscheffl
Copy link
Copy Markdown
Contributor

jscheffl commented Jan 3, 2026

@stephen-bracken / @shahar1 as concerns raised in https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1767429130216899 - one thing to consider maybe is - I'd propose to - adding a newsfragment such that it is highlighted in the changelog of the provider.

TLDR: Requesting to add a newsfragment to highlight this interface change.

UPDATE: Args, providers have no newsfragmen, add it to changelogs like in https://github.com/apache/airflow/pull/59143/changes#diff-24cff4e7b7926b95f4efef45da9f9d6f43b237b5143990b1554113251cb2c12eR30 for example that it is included in next providers wave.

@stephen-bracken stephen-bracken force-pushed the stephen-bracken/cncf-kubernetes-job-fix branch 2 times, most recently from 8befa0f to 5a29826 Compare January 5, 2026 10:03
@stephen-bracken
Copy link
Copy Markdown
Contributor Author

@stephen-bracken / @shahar1 as concerns raised in https://apache-airflow.slack.com/archives/C06K9Q5G2UA/p1767429130216899 - one thing to consider maybe is - I'd propose to - adding a newsfragment such that it is highlighted in the changelog of the provider.

TLDR: Requesting to add a newsfragment to highlight this interface change.

UPDATE: Args, providers have no newsfragmen, add it to changelogs like in https://github.com/apache/airflow/pull/59143/changes#diff-24cff4e7b7926b95f4efef45da9f9d6f43b237b5143990b1554113251cb2c12eR30 for example that it is included in next providers wave.

@jscheffl Thanks, added a changelog note.

@stephen-bracken stephen-bracken force-pushed the stephen-bracken/cncf-kubernetes-job-fix branch from 5a29826 to 50261b1 Compare January 5, 2026 10:36
@stephen-bracken stephen-bracken force-pushed the stephen-bracken/cncf-kubernetes-job-fix branch from 50261b1 to 6b3dc30 Compare January 5, 2026 14:08
@shahar1 shahar1 marked this pull request as draft January 10, 2026 10:32
@shahar1
Copy link
Copy Markdown
Contributor

shahar1 commented Jan 10, 2026

Drafting the PR following the author's request to make some changes, please do not merge

Copy link
Copy Markdown
Contributor

@shahar1 shahar1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See previous comment

@stephen-bracken stephen-bracken force-pushed the stephen-bracken/cncf-kubernetes-job-fix branch from 6b3dc30 to faae3ae Compare January 10, 2026 10:35
@boring-cyborg
Copy link
Copy Markdown

boring-cyborg bot commented Jan 10, 2026

Awesome work, congrats on your first merged pull request! You are invited to check our Issue Tracker for additional contributions.

@shahar1
Copy link
Copy Markdown
Contributor

shahar1 commented Jan 10, 2026

Great job @stephen-bracken !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:providers provider:cncf-kubernetes Kubernetes (k8s) provider related issues

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants