Skip to content

Some tasks stay forever in "queued" state #51569

@val-lavrentiev

Description

@val-lavrentiev

Apache Airflow version

3.0.2

If "Other Airflow 2 version" selected, which one?

No response

What happened?

A failed task instead of being restarted stays forever in the "queued" state. Scheduler according to the logs tries to schedule it (name - cdp_profiles_send_to_dest.run_export_json_and_send_to_destination) only once (for some unknown reason):

Jun 10 10:51:26 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:51:26.588+0000] {_client.py:1026} INFO - HTTP Request: PUT https://airflow-server.com/execution/task-instances/01975940-1bb5-7690-a48f-7b09540f8373/heartbeat "HTTP/1.1 204 No Cont>
Jun 10 10:51:31 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:51:31.498+0000] {scheduler_job_runner.py:2128} INFO - Adopting or resetting orphaned tasks for active dag runs
Jun 10 10:51:31 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:51:31.614+0000] {_client.py:1026} INFO - HTTP Request: PUT  https://airflow-server.com/execution/task-instances/01975940-1bb5-7690-a48f-7b09540f8373/heartbeat "HTTP/1.1 204 No Cont>
Jun 10 10:51:36 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:51:36.795+0000] {_client.py:1026} INFO - HTTP Request: PUT  https://airflow-server.com/execution/task-instances/01975940-1bb5-7690-a48f-7b09540f8373/heartbeat "HTTP/1.1 204 No Cont>
Jun 10 10:51:36 ip-10-0-0-77 start.sh[1369546]: scheduler     | 2025-06-10 10:51:36 [debug    ] Received message from task runner [supervisor] msg=RetryTask(state='up_for_retry', end_date=datetime.datetime(2025, 6, 10, 10, 51, 36, 773750, tzinfo=TzInfo(UTC)), rendered_m>
Jun 10 10:51:36 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:51:36.839+0000] {_client.py:1026} INFO - HTTP Request: PATCH  https://airflow-server.com/execution/task-instances/01975940-1bb5-7690-a48f-7b09540f8373/state "HTTP/1.1 204 No Conten>
Jun 10 10:51:36 ip-10-0-0-77 start.sh[1369546]: scheduler     | 2025-06-10 10:51:36 [debug    ] Received message from task runner [supervisor] msg=SetRenderedFields(rendered_fields={'op_args': [], 'op_kwargs': {}, 'bash_command': 'direnv allow /data/ephemeral/airflow-ap>
Jun 10 10:51:36 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:51:36.857+0000] {_client.py:1026} INFO - HTTP Request: PUT  https://airflow-server.com/execution/task-instances/01975940-1bb5-7690-a48f-7b09540f8373/rtif "HTTP/1.1 404 Not Found"
Jun 10 10:51:36 ip-10-0-0-77 start.sh[1369546]: scheduler     | 2025-06-10 10:51:36 [warning  ] Server error                   [airflow.sdk.api.client] detail=None
Jun 10 10:51:36 ip-10-0-0-77 start.sh[1369546]: scheduler     | 2025-06-10 10:51:36 [error    ] API server error               [supervisor] detail={'detail': 'Not Found'} message='Not Found' status_code=404
Jun 10 10:51:51 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:51:51.976+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:52:22 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:52:22.126+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:52:52 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:52:52.272+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:53:23 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:53:23.490+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:53:53 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:53:53.628+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:54:24 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:54:24.826+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:54:54 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:54:54.965+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:55:26 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:55:26.180+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:55:57 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:55:57.145+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:56:27 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:56:27.533+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:56:31 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:56:31.570+0000] {scheduler_job_runner.py:2128} INFO - Adopting or resetting orphaned tasks for active dag runs
Jun 10 10:56:49 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:56:49.831+0000] {scheduler_job_runner.py:450} INFO - 1 tasks up for execution:
Jun 10 10:56:49 ip-10-0-0-77 start.sh[1369546]: scheduler     |         <TaskInstance: cdp_profiles_send_to_dest.run_export_json_and_send_to_destination scheduled__2025-06-09T07:00:00+00:00 [scheduled]>
Jun 10 10:56:49 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:56:49.831+0000] {scheduler_job_runner.py:522} INFO - DAG cdp_profiles_send_to_dest has 0/16 running and queued tasks
Jun 10 10:56:49 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:56:49.832+0000] {scheduler_job_runner.py:661} INFO - Setting the following tasks to queued state:
Jun 10 10:56:49 ip-10-0-0-77 start.sh[1369546]: scheduler     |         <TaskInstance: cdp_profiles_send_to_dest.run_export_json_and_send_to_destination scheduled__2025-06-09T07:00:00+00:00 [scheduled]>
Jun 10 10:56:49 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:56:49.834+0000] {scheduler_job_runner.py:767} INFO - Trying to enqueue tasks: [<TaskInstance: cdp_profiles_send_to_dest.run_export_json_and_send_to_destination scheduled__2025-06-09T07:00:00+0>
Jun 10 10:56:58 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:56:58.093+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:57:29 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:57:29.518+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:57:59 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:57:59.976+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00
Jun 10 10:58:31 ip-10-0-0-77 start.sh[1369546]: scheduler     | [2025-06-10T10:58:31.508+0000] {dag.py:2509} INFO - Setting next_dagrun for content_classifier_evaluation to 2025-06-09 00:00:00+00:00, run_after=2025-06-11 00:00:00+00:00

After that the logs only show "Settint next_dagrun..." entries, not other attempt to schedule the task

What you think should happen instead?

After a fail a task is restarted.

How to reproduce

We run airflow on ec2 under ELB (on a single machine for now) on a path like https://airflow-server.com/airflow. We use default parameters apart from:


export AIRFLOW_HOME={{ airflow_v2_home }}
export AIRFLOW__API__BASE_URL=http://my-host-ip:8080/airflow
export AIRFLOW__CORE__EXECUTOR=LocalExecutor
export AIRFLOW__CORE__LOAD_EXAMPLES=false
export AIRFLOW__CORE__PARALLELISM=32
export AIRFLOW__CORE__SIMPLE_AUTH_MANAGER_ALL_ADMINS=True
export AIRFLOW__CORE__DAGS_FOLDER=/apps/airflow/dags
export AIRFLOW__SECRETS__BACKEND=airflow.providers.amazon.aws.secrets.secrets_manager.SecretsManagerBackend
export AIRFLOW__SECRETS__BACKEND_KWARGS='{"region_name": "eu-west-1", "connections_prefix": "", "variables_prefix": "", "config_prefix": ""}'
export AIRFLOW__SCHEDULER__CREATE_CRON_DATA_INTERVALS=True
export AIRFLOW__DATABASE__SQL_ALCHEMY_CONN=postgresql://{{ airflow_v2_postgres_user }}:{{ airflow_v2_postgres_password }}@{{ airflow_v2_postgres_host }}/{{ airflow_v2_postgres_database }}
# workaround for airflow v3.0.1 to avoid scheduler jwt expiration errors
export AIRFLOW__API_AUTH__JWT_CLI_EXPIRATION_TIME=315360000
export AIRFLOW__API_AUTH__JWT_EXPIRATION_TIME=315360000
export AIRFLOW__EXECUTION_API__JWT_EXPIRATION_TIME=315360000
export AIRFLOW__SCHEDULER__ENABLE_HEALTH_CHECK=True

We also has an issue with JWT token expiration errors which we for now circumvented with the longer expiration time setting (for this perhaps we can raise a separate issue later)

Installation script on ec2 instance (arm, Ubuntu):

    . /etc/profile.d/nix.sh
    && . /etc/profile.d/Z50-devbox.sh
    && devbox global add uv overmind direnv s5cmd pixi
    && mkdir -p /apps/airflow/apps
    && mkdir -p /apps/airflow/dags
    && uv venv /apps/airflow/.venv --python 3.11
    && uv pip install --python {{ airflow_v2_home }}/.venv/bin/python
    --constraint https://github.com/apache/airflow/constraints-3.0.1/constraints-3.11.txt
    'apache-airflow[amazon,slack,standard]'
    asyncpg
    psycopg2-binary

Operating System

Ubuntu 22.04.5 LTS

Versions of Apache Airflow Providers

No response

Deployment

PyPi

Deployment details

Database was upgraded from version 2.10.4.

Anything else?

No response

Are you willing to submit PR?

  • Yes I am willing to submit a PR!

Code of Conduct

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions