Skip to content

Fix API server memory leak: bound DBDagBag version cache with LRU eviction#64326

Open
dheerajturaga wants to merge 7 commits intoapache:mainfrom
dheerajturaga:bugfix/api-memory-leak
Open

Fix API server memory leak: bound DBDagBag version cache with LRU eviction#64326
dheerajturaga wants to merge 7 commits intoapache:mainfrom
dheerajturaga:bugfix/api-memory-leak

Conversation

@dheerajturaga
Copy link
Copy Markdown
Member

@dheerajturaga dheerajturaga commented Mar 27, 2026

DBDagBag._dags is an unbounded in-memory cache causing steady memory
growth in the API server.

DBDagBag was designed for the scheduler, which works with a bounded set
of currently-active DAG versions. As an API server singleton, it is exposed to
the full history of DAG versions in the database with no bound on how
many it will cache.

Replace the plain dict in DBDagBag._dags with an OrderedDict-based LRU
cache. In long-running API server processes, every unique dag_version_id
accessed is inserted and never evicted, causing unbounded RSS growth (observed:
9.4 GiB after 7 days with ~70k DAG versions in DB).

The scheduler and API server have different access patterns, so the cache
policy is now split:

  • API server: bounded LRU cache, capped at 4096 entries by default
    (configurable via api.dag_version_cache_size). Cache hits promote the
    entry to MRU so frequently-accessed versions are retained over stale
    historical ones.
  • Scheduler: explicitly unbounded (max_cache_size=None). Its working
    set is naturally capped at one version per active DAG, so a size limit
    would add eviction overhead with no benefit.

Add Stats metrics to make the cache observable:
dag_bag.cache.hits, dag_bag.cache.misses, dag_bag.cache.evictions,
dag_bag.cache.size — emitted to the configured StatsD/OTEL backend,
no-op if metrics are not configured.

image
Was generative AI tooling used to co-author this PR?
  • Yes (please specify the tool below)
    ClaudeCode

@boring-cyborg boring-cyborg bot added area:API Airflow's REST/HTTP API area:ConfigTemplates labels Mar 27, 2026
@eladkal eladkal added this to the Airflow 3.2.0 milestone Mar 27, 2026
@eladkal eladkal added the type:bug-fix Changelog: Bug Fixes label Mar 27, 2026
@shivaam
Copy link
Copy Markdown
Contributor

shivaam commented Mar 28, 2026

Nice. Seems like a real production bug. A few thoughts:

  1. Default of 512 may be too low. The scheduler processes all active DAGs every cycle. With 1000+ DAGs, a 512 cache means constant eviction and re-fetching from the DB on every loop. The API server's Execution API also serves worker requests for every task state transition, so it can accumulate entries fast too. Consider starting higher (2048+) and letting people tune down — it's easier to reduce a known number than to discover you need to increase one you didn't know existed.
  2. A single config for both scheduler and API server may not be ideal. The scheduler's working set is bounded (latest version per active DAG) and performance-sensitive — it needs a cache big enough to hold all active DAGs. There are no metrics for the cache which will also cause problems in debugging

@dheerajturaga
Copy link
Copy Markdown
Member Author

Nice. Seems like a real production bug. A few thoughts:

  1. Default of 512 may be too low. The scheduler processes all active DAGs every cycle. With 1000+ DAGs, a 512 cache means constant eviction and re-fetching from the DB on every loop. The API server's Execution API also serves worker requests for every task state transition, so it can accumulate entries fast too. Consider starting higher (2048+) and letting people tune down — it's easier to reduce a known number than to discover you need to increase one you didn't know existed.
  2. A single config for both scheduler and API server may not be ideal. The scheduler's working set is bounded (latest version per active DAG) and performance-sensitive — it needs a cache big enough to hold all active DAGs. There are no metrics for the cache which will also cause problems in debugging

Done! scheduler is now not bound by the cache. Its only the API server that can have the cache size configurable. Also added metrics to track.

@dheerajturaga dheerajturaga added the ready for maintainer review Set after triaging when all criteria pass. label Mar 31, 2026
@kaxil kaxil requested a review from Copilot April 2, 2026 00:42
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

Copy link
Copy Markdown
Contributor

@jscheffl jscheffl left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for 3.2.1!

@jscheffl jscheffl added the backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch label Apr 3, 2026
@shivaam
Copy link
Copy Markdown
Contributor

shivaam commented Apr 5, 2026

FYI @kaxil — this overlaps with your #60804, which tackles the same DBDagBag._dags memory growth with a similar LRU approach. Wanted to make sure you both were aware of each other's PRs.

dheerajturaga and others added 3 commits April 6, 2026 12:49
…ction

  Replace the plain dict in DBDagBag._dags with a bounded OrderedDict-based
  LRU cache. In long-running API server processes, every unique dag_version_id
  accessed is inserted and never evicted, causing unbounded RSS growth (observed:
  9.4 GiB after 7 days with ~70k DAG versions in DB).

  The cache is now capped at 512 entries by default (configurable via
  core.max_dag_version_cache_size). Cache hits promote the entry to MRU so
  frequently-accessed versions are retained over stale historical ones.
Co-authored-by: Elad Kalif <45845474+eladkal@users.noreply.github.com>
dheerajturaga and others added 2 commits April 6, 2026 12:49
  Replace the unbounded DBDagBag._dags dict with an OrderedDict-based LRU
  cache. In long-running API server processes every unique dag_version_id was
  inserted and never evicted, causing unbounded RSS growth (observed: 9.4 GiB
  after 7 days with ~70k DAG versions in DB).

  - API server: bounded LRU cache, size controlled by api.dag_version_cache_size
    (default 4096). Config lives in [api] because only the API server accumulates
    historical versions.
  - Scheduler: explicitly unbounded (max_cache_size=None). Its working set is
    naturally capped at one version per active DAG, so a size limit would add
    eviction overhead with no benefit.
  - Add Stats metrics: dag_bag.cache.hits, dag_bag.cache.misses,
    dag_bag.cache.evictions, dag_bag.cache.size (sampled at 10%) — emitted to
    the configured StatsD/OTEL backend, no-op if metrics are not configured.
Co-authored-by: Jens Scheffler <95105677+jscheffl@users.noreply.github.com>
@dheerajturaga dheerajturaga force-pushed the bugfix/api-memory-leak branch from 9712e2b to 6942523 Compare April 6, 2026 17:49
Copy link
Copy Markdown
Member

@pierrejeambrun pierrejeambrun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compared to Kaxil version, I believe this implementation isn't thread safe, which is a problem. Can we fill that gap?

Otherwise looking good to me.

@pierrejeambrun
Copy link
Copy Markdown
Member

pierrejeambrun commented Apr 7, 2026

cc: @kaxil in case you want to take a look :)

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 6 changed files in this pull request and generated 4 comments.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:API Airflow's REST/HTTP API area:ConfigTemplates backport-to-v3-2-test Mark PR with this label to backport to v3-2-test branch ready for maintainer review Set after triaging when all criteria pass. type:bug-fix Changelog: Bug Fixes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants