Skip to content

Tasks Stuck at Scheduled State #40106

@hbc-acai

Description

@hbc-acai

Apache Airflow version

2.9.1

If "Other Airflow 2 version" selected, which one?

No response

What happened?

After upgrading to 2.9.1, we find tasks are stuck at scheduled state after about 1 hour scheduler started. During the first hour, all tasks are running fine. Then I restarted the scheduler, and it successfully moved the "stuck" task instances to queued state and then run them. But new tasks got stuck again after one hour.

This is reproduceable in my production cluster. It happens every time after we restart our scheduler. But we are not able to replicate this in our dev cluster.

There are no errors in the scheduler log. Here is some logs where things went wrong. I manually cleared one DAG with 3 tasks. 2 of the 3 tasks ran successufully, but task got stuck in the scheduled state. In the below log I only found information about the 2 tasks (database_setup, positions_extract ) that ran successfully.

[2024-06-07T01:12:52.113+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:13:52.199+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:14:52.284+0000] {kubernetes_executor.py:240} INFO - Found 0 queued task instances
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:417} INFO - 2 tasks up for execution:
        <TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
        <TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 0/16 running and queued tasks
[2024-06-07T01:15:37.976+0000] {scheduler_job_runner.py:480} INFO - DAG update_risk_exposure_store has 1/16 running and queued tasks
[2024-06-07T01:15:37.977+0000] {scheduler_job_runner.py:596} INFO - Setting the following tasks to queued state:
        <TaskInstance: update_risk_exposure_store.database_setup scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
        <TaskInstance: update_risk_exposure_store.positions_extract scheduled__2024-05-28T10:10:00+00:00 [scheduled]>
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='database_setup', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.980+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'database_setup', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']
[2024-06-07T01:15:37.980+0000] {scheduler_job_runner.py:639} INFO - Sending TaskInstanceKey(dag_id='update_risk_exposure_store', task_id='positions_extract', run_id='scheduled__2024-05-28T10:10:00+00:00', try_number=3, map_index=-1) to executor with priority 25 and queue default
[2024-06-07T01:15:37.981+0000] {base_executor.py:149} INFO - Adding to queue: ['airflow', 'tasks', 'run', 'update_risk_exposure_store', 'positions_extract', 'scheduled__2024-05-28T10:10:00+00:00', '--local', '--subdir', 'DAGS_FOLDER/update_risk_exposure_store.py']

What you think should happen instead?

No response

How to reproduce

I can easily reproduce it in my production cluster. But I cannot reproduce it in our dev cluster. Both clusters have almost exactly the same setup.

Operating System

Azure Kubernetes Service

Versions of Apache Airflow Providers

No response

Deployment

Official Apache Airflow Helm Chart

Deployment details

Using apache-airflow:2.9.1-python3.10 image

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions