Remove caching for attention masks #2117

wwwjn · 2025-12-05T19:16:24Z

We remove the lru_cache for attention masks, because in get_attention_mask() function, and_masks(*mask_mods) will return different object id. create_attention_mask will use all parameters as cache key, and new object id will always cause cache miss.

Before the change: (llama3 debugmodel_flex_attn)

After the change:

tianyu-l · 2025-12-05T19:59:58Z

torchtitan/models/attention.py

-        return (kv_idx <= q_idx) & (q_idx - kv_idx < window_size)
-
+    # Use functools.partial to bind window_size while keeping the function cacheable
+    sliding_window_mod = functools.partial(_sliding_window_mask, window_size)


I might have missed this point -- calling partial the 2nd time with the same window_size may not give cache hit, and if so this approach won't work. Could you verify?

cc @fegin

You are right, I missed it. I checked the object id when calling the partial 2nd time with the same window_size, and the id are different.

So I changed to using functiontool.lru_cache to do explicit caching, this would use window_size as cache key. I verified the ids are the same. Wdyt about this solution?

torchtitan/models/attention.py

tianyu-l · 2025-12-05T23:16:51Z

torchtitan/models/attention.py

    return blocked_mask_mod


+@functools.lru_cache(4)


I'm worried that for the only use case
https://github.com/pytorch/torchtitan/blob/main/torchtitan/experiments/gpt_oss/model/model.py#L364
an and_masks will be called for each iteration, whose results will have a different object id for each call (at the beginning of each layer).

In general, the caching mechanism around masks sounds not very robust. The per-iteration overhead might be fine if caching is removed. Should we remove all caching altogether? cc @fegin

I see, looks like we are using and_masks() for almost all models now.

If the 1and_maskreturns a new object id for each call, then the lru cache aroundcreate_attention_mask` will always miss (because the returned if from add_mask will be part of cache key)

That's my guess. Could you verify? If so, I'd recommend we remove all the lru cache annotation altogether for better readability and some memory saving (because we no longer maintain a cache which is never used).

Yes I already verified that and_mask will return different object id everytime. I will move towards removing all cache annotation.

tianyu-l

LGTM, one nit

torchtitan/models/attention.py

fegin · 2025-12-09T21:26:16Z

Yes, we should remove the lru_cache. It was added before we use add_mask to hope that we can cache some masks (e.g., causal mask). This very hard to achieve in the API level. It may be better to let users decide what to cache.

move sliding window mod

cb89767

wwwjn requested review from fegin, tianyu-l and wconstab as code owners December 5, 2025 19:16

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Dec 5, 2025

tianyu-l reviewed Dec 5, 2025

View reviewed changes

use lru_cache

eab33ce

wwwjn changed the title ~~Move sliding window mod to make the mod cacheable~~ Cache sliding window mod for sliding window attention masks Dec 5, 2025

tianyu-l reviewed Dec 5, 2025

View reviewed changes

remove lru cache for attention mask

4c7687d

wwwjn changed the title ~~Cache sliding window mod for sliding window attention masks~~ Remove caching for attention masks Dec 9, 2025

tianyu-l approved these changes Dec 9, 2025

View reviewed changes

torchtitan/models/attention.py Outdated Show resolved Hide resolved

wwwjn added 3 commits December 9, 2025 13:32

revert docstring changes

02f03e8

fix

6d65d41

remove cache comments

8cb3939

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Remove caching for attention masks #2117

Remove caching for attention masks #2117

Uh oh!

wwwjn commented Dec 5, 2025 •

edited

Loading

Uh oh!

tianyu-l Dec 5, 2025

Uh oh!

wwwjn Dec 5, 2025

Uh oh!

Uh oh!

tianyu-l Dec 5, 2025

Uh oh!

wwwjn Dec 5, 2025

Uh oh!

tianyu-l Dec 7, 2025

Uh oh!

wwwjn Dec 7, 2025

Uh oh!

tianyu-l left a comment

Uh oh!

Uh oh!

fegin commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Remove caching for attention masks #2117

Are you sure you want to change the base?

Remove caching for attention masks #2117

Uh oh!

Conversation

wwwjn commented Dec 5, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tianyu-l Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

tianyu-l Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Dec 5, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

wwwjn Dec 7, 2025

Choose a reason for hiding this comment

Uh oh!

tianyu-l left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

fegin commented Dec 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

wwwjn commented Dec 5, 2025 •

edited

Loading