-
Notifications
You must be signed in to change notification settings - Fork 16.8k
Tasks Stuck at Scheduled State #40106
Description
Apache Airflow version
2.9.1
If "Other Airflow 2 version" selected, which one?
No response
What happened?
After upgrading to 2.9.1, we find tasks are stuck at scheduled state after about 1 hour scheduler started. During the first hour, all tasks are running fine. Then I restarted the scheduler, and it successfully moved the "stuck" task instances to queued state and then run them. But new tasks got stuck again after one hour.
This is reproduceable in my production cluster. It happens every time after we restart our scheduler. But we are not able to replicate this in our dev cluster.
There are no errors in the scheduler log. Here is some logs where things went wrong. I manually cleared one DAG with 3 tasks. 2 of the 3 tasks ran successufully, but task got stuck in the scheduled state. In the below log I only found information about the 2 tasks (database_setup, positions_extract ) that ran successfully.
[2024-06-07T01:12:52.113+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:13:52.199+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:14:52.284+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:417} INFO - 2 tasks up for execution:
<TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
<TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 0/16 running and queued tasks
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 1/16 running and queued tasks
[2024-06-07T01:15:37.977+0000] {scheduler_job_runner.py:596} INFO - Setting the following tasks to queued state:
<TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
<TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='database_setup', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.980+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'database_setup', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='positions_extract', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.981+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'positions_extract', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
What you think should happen instead?
No response
How to reproduce
I can easily reproduce it in my production cluster. But I cannot reproduce it in our dev cluster. Both clusters have almost exactly the same setup.
Operating System
Azure Kubernetes Service
Versions of Apache Airflow Providers
No response
Deployment
Official Apache Airflow Helm Chart
Deployment details
Using apache-airflow:2.9.1-python3.10 image
Anything else?
No response
Are you willing to submit PR?
- Yes I am willing to submit a PR!
Code of Conduct
- I agree to follow this project's Code of Conduct