Skip to content

HF Trainer: ALST/Ulysses sequence parallelism integration via HF Accelerate#41832

Merged
ArthurZucker merged 55 commits into
huggingface:mainfrom
stas00:alst-integration
Nov 21, 2025
Merged

HF Trainer: ALST/Ulysses sequence parallelism integration via HF Accelerate#41832
ArthurZucker merged 55 commits into
huggingface:mainfrom
stas00:alst-integration

Conversation

@stas00

@stas00 stas00 commented Oct 23, 2025

Copy link
Copy Markdown
Contributor

Integrates HF Accelerate's support for ALST/Ulysses sequence parallelism huggingface/accelerate#3817 into HF Trainer

TODO:

…lerate

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
@stas00

stas00 commented Oct 23, 2025

Copy link
Copy Markdown
Contributor Author

I have a hard time finding where I can add the documentation for this new backend. Context parallelism isn't documented anywhere in HF Trainer - how can users discover it? If I'm missing it, could you please point me to where I should extend the documentation? Thank you.

Same story with CP tests - there are none :( so need to figure out how to write some.

@stas00

stas00 commented Oct 24, 2025

Copy link
Copy Markdown
Contributor Author

OK, the next issue with the existing integration of CP/FSDP. What does the following mean?

 $ sometrainerscript.py --help
 [...]
 --parallelism_config PARALLELISM_CONFIG, --parallelism-config PARALLELISM_CONFIG

this arg tells users absolutely nothing about what value(s) to pass to --parallelism_config. If it's not meant to be in CLI args and can only be used explicitly by writing code perhaps it shouldn't be listed in --help or at least say that it has to be coded?

@Rocketknight1

Copy link
Copy Markdown
Member

cc @SunMarc

@stas00 stas00 marked this pull request as draft October 27, 2025 04:42
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Comment thread src/transformers/testing_utils.py
return env

def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None):
def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None, return_pathlib_obj=False):

@stas00 stas00 Oct 28, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a really old version. In the latest incarnation it always return a Path object. But to keep BC, I added a new flag here instead. The tests are less clunkier then.

The latest version is here: https://github.com/stas00/ml-engineering/blob/master/testing/testing_utils.py

If wanted you could switch to the latest version instead and adapt tests to simplify.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's much better for it to always return a pathlib.Path object but you'd need to tweak a few tests which use this API.

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
@stas00 stas00 marked this pull request as ready for review October 28, 2025 02:34
@stas00

stas00 commented Oct 28, 2025

Copy link
Copy Markdown
Contributor Author

@SunMarc, this is ready for a review. Tests fail because they need the accelerate PR huggingface/accelerate#3817

I just didn't know where to update docs since parallelism doesn't seem to be documented here at all. Please correct me if I'm wrong.

Thanks to @kashif with the test.

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

@SunMarc SunMarc left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this clean integration ! Left a couple of comments. It would be great @kashif @qgallouedec if you can have a look at this PR so that we can also make it compatible with TRL.

return env

def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None):
def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None, return_pathlib_obj=False):

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread src/transformers/trainer.py Outdated
Comment thread src/transformers/trainer.py Outdated
Comment on lines +3865 to +3871
losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
# special dealing with SFT that has prompt tokens that aren't used in loss computation
good_tokens = (shift_labels != -100).view(-1).sum()
good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
total_loss = sum(losses_per_rank[rank] * good_tokens_per_rank[rank] for rank in range(sp_world_size))
total_good_tokens = sum(good_tokens_per_rank)
loss = total_loss / max(total_good_tokens, 1)

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need to do this if num_items_in_batch is computed and passed in in unwrapped_model.loss_function. num_items_in_batch was introduced to fix the gradient accumulation https://unsloth.ai/blog/gradient. num_items_in_batch is basically total_good_tokens if grad_acc = 1.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry, what is not needed?

This code block is because we need to compute the correct loss across SP ranks. If you just average those it'll be incorrect in the case of -100 masked tokens (SFT), since each rank is likely to process a different number of unmasked tokens (this is not DP averaging).

Unless what you mean is that we don't need to calculate total_good_tokens since num_items_in_batch is already that, but the rest of the code remains - did I understand you correctly?

@SunMarc SunMarc Nov 4, 2025

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you pass num_items_in_batch in loss_function, it will sum the loss then divide it by num_items_in_batch directly. This way I think we don't need to actually to recalculate the total_loss from the averaged losses and the good_tokens_per_rank. Maybe I'm wrong so please correct me ! But I think this might solve the grad acc issue. In any case, we will keep the current code as not all models accepts num_items_in_batch when calculating the loss.

total_loss = sum(losses_per_rank[rank] * good_tokens_per_rank[rank] for rank in range(sp_world_size))

def ForCausalLMLoss(
    logits,
    labels,
    vocab_size: int,
    num_items_in_batch: Optional[int] = None,
    ignore_index: int = -100,
    shift_labels: Optional[torch.Tensor] = None,
    **kwargs,
) -> torch.Tensor:
    # Upcast to float if we need to compute the loss to avoid potential precision issues
    logits = logits.float()

    if shift_labels is None:
        # Shift so that tokens < n predict n
        labels = nn.functional.pad(labels, (0, 1), value=ignore_index)
        shift_labels = labels[..., 1:].contiguous()

    # Flatten the tokens
    logits = logits.view(-1, vocab_size)
    shift_labels = shift_labels.view(-1)
    # Enable model parallelism
    shift_labels = shift_labels.to(logits.device)
    loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs)
    return loss


def fixed_cross_entropy(
    source: torch.Tensor,
    target: torch.Tensor,
    num_items_in_batch: Optional[int] = None,
    ignore_index: int = -100,
    **kwargs,
) -> torch.Tensor:
    reduction = "sum" if num_items_in_batch is not None else "mean"
    loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction)
    if reduction == "sum":
        loss = loss / num_items_in_batch
    return loss

@stas00 stas00 Nov 4, 2025

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you pass num_items_in_batch you indeed don't need to do local loss calculation since it'll do that already. But we need to calculate a distributed across ranks loss.

Here is an example: Let's take a 2k tokens sample SP-split across 2 ranks using SFT:

  1. SP rank0 - 900 masked and 100 non-masked tokens (a long initial prompt that is -100 masked out)
  2. SP rank1 - 100 masked and 900 non-masked tokens

So each rank produces the correct loss if we use num_items_in_batch - but how do you combine the losses of 2 ranks. straight average will give a very skewed result, because the rank0's loss contributes 9x less non-masked tokens.

Let's take it to a more telling example:

  1. SP rank0 - 1000 masked and 0 non-masked tokens (a long initial prompt that is masked out)
  2. SP rank1 - 0 masked and 1000 non-masked tokens

here rank0 can't even contribute anything to the total loss - a normal averaging of 2 losses would be completely broken, since you'd average with an undefined behavior, since the loss function will return a NaN or None.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So each rank produces the correct loss if we use num_items_in_batch - but how do you combine the losses of 2 ranks. straight average will give a very skewed result, because the rank0's loss contributes 9x less non-masked tokens.

The denominator of the losses is both num_items_in_batch, the value of each loss already takes into account the number of non-masked tokens as we do reduction = "sum". So we just sum them to get the final loss. In your first examples, num_items_in_batch will be equal to 1000. For rank0, the loss will be equal to (L1+...L100)/1000 and for rank1, it will be (l1+..+l900)/1000

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have a feeling we are missing each other. I'm talking about differentiable loss combination across ranks and I think you're talking about the local rank's loss.

Could you please point me to the code in HF Trainer that performs a differentiable loss combination across multiple ranks? I couldn't find any.

Comment thread src/transformers/trainer.py Outdated
Comment thread src/transformers/trainer.py Outdated
@kashif

kashif commented Nov 4, 2025

Copy link
Copy Markdown
Contributor

Thanks @SunMarc I did an initial run with default settings in TRL and these branches, and it all worked nicely (apart from model saving due to not updating deepspeed i think). I will check the config that uses a bespoke compute_loss in the sft trainer, thanks for the heads up!

@SunMarc

SunMarc commented Nov 4, 2025

Copy link
Copy Markdown
Member

@SunMarc, this is ready for a review. Tests fail because they need the accelerate PR huggingface/accelerate#3817

I just didn't know where to update docs since parallelism doesn't seem to be documented here at all. Please correct me if I'm wrong.

Thanks to @kashif with the test.

We indeed do not have docs related to that. @kashif as you added cp support in trainer, would you be willing to add some docs around that.
@stas00, I think we can either update deepspeed docs https://huggingface.co/docs/transformers/main/en/deepspeed and/or create a new docs called contextparallel like in accelerate.

@HuggingFaceDocBuilderDev

Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@kashif

kashif commented Nov 4, 2025

Copy link
Copy Markdown
Contributor

@SunMarc docs on TRL side: huggingface/trl#4420

@kashif

kashif commented Nov 4, 2025

Copy link
Copy Markdown
Contributor

i can look at the docs on the accelerate side as well...

@stas00

stas00 commented Nov 4, 2025

Copy link
Copy Markdown
Contributor Author

@stas00, I think we can either update deepspeed docs https://huggingface.co/docs/transformers/main/en/deepspeed and/or create a new docs called contextparallel like in accelerate.

Ideally we would have a dedicated doc like you suggested, which could then link into deepspeed for nuances as one way to do that. The key is for the user to quickly understand what's possible, thus a single context parallel entry point doc would be very useful to users.

sfc-gh-sbekman and others added 6 commits November 5, 2025 03:38
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Comment thread src/transformers/trainer.py Outdated
@zhangwj618

zhangwj618 commented Nov 20, 2025

Copy link
Copy Markdown

Is there a mismatch between the docs and the code? docs/source/en/deepspeed.md says:
"By default, when you only configure sp_size, DP is automatically calculated as dp_size = world_size / sp_size."
However, when I run the code with sp_size != world_size, I get this error, unless I specify dp_replicate_size manully.

@kashif

kashif commented Nov 20, 2025

Copy link
Copy Markdown
Contributor

@zhangwj618 the doc is wrong... my bad let me fix it!

Comment thread docs/source/en/deepspeed.md
Comment thread docs/source/en/deepspeed.md
Comment thread docs/source/en/deepspeed.md Outdated
Comment thread src/transformers/training_args.py Outdated
Comment thread src/transformers/trainer.py Outdated
kashif and others added 3 commits November 20, 2025 17:28
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
Comment thread src/transformers/trainer.py Outdated
Comment thread docs/source/en/deepspeed.md Outdated
@ArthurZucker ArthurZucker merged commit 7e0ea69 into huggingface:main Nov 21, 2025
15 of 21 checks passed
@ArthurZucker

Copy link
Copy Markdown
Collaborator

Thanks a lot @stas00 for your work 🤗

@stas00 stas00 deleted the alst-integration branch November 21, 2025 17:39
@stas00

stas00 commented Nov 21, 2025

Copy link
Copy Markdown
Contributor Author

super! Thanks a lot to Marc and Kashif for help with integration and Weijie Zhang for being the first early adopter!

SangbumChoi pushed a commit to SangbumChoi/transformers that referenced this pull request Jan 23, 2026
…lerate (huggingface#41832)

* HF Trainer: ALST/Ulysses sequence parallelism integration via HF Accelerate

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* make it work + tests

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* cleanup

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* undo

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* normalize

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* always return cp_size

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* cleanup

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* extract code into _deepspeed_cp_compute_loss

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* fix

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* ALST/Ulysses sequence parallelism docs

* typo

* add link to UlyssesSPDataLoaderAdapter

* adapt to renaming to SP

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* improve

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* fix

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* Update docs/source/en/deepspeed.md

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* address comments

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* address comments

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* Update src/transformers/trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* address comments

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* address comments

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* Update src/transformers/trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update src/transformers/trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* style

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

* Update docs/source/en/deepspeed.md

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/en/deepspeed.md

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Account for Sequence Parallelism (SP) dataloader adapter effect

* Update src/transformers/trainer.py

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/en/deepspeed.md

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* Update docs/source/en/deepspeed.md

Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>

* model_accepts_loss_kwargs to False

* better comment

* Apply suggestion from @kashif

* Apply suggestion from @kashif

* Apply suggestions from code review

* Apply suggestion from @kashif

* Apply suggestion from @kashif

* Apply suggestion from @kashif

* Update src/transformers/trainer.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Update src/transformers/training_args.py

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

* Apply suggestion from @kashif

* Apply suggestion from @kashif

---------

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Stas Bekman <stas.bekman@snowflake.com>
Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com>
Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>
aar-public-version-bump-bot Bot added a commit to Aleph-Alpha-Research/eval-framework that referenced this pull request Jun 3, 2026
> ℹ️ **Note**
> 
> This PR body was truncated due to platform limits.

This PR contains the following updates:

| Package | Change |
[Age](https://docs.renovatebot.com/merge-confidence/) |
[Confidence](https://docs.renovatebot.com/merge-confidence/) |
|---|---|---|---|
| [accelerate](https://redirect.github.com/huggingface/accelerate) |
`>=0.34.2,<1` → `>=1.13.0,<2` |
![age](https://developer.mend.io/api/mc/badges/age/pypi/accelerate/1.13.0?slim=true)
|
![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/accelerate/0.34.2/1.13.0?slim=true)
|

---

### Release Notes

<details>
<summary>huggingface/accelerate (accelerate)</summary>

###
[`v1.13.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.13.0):
: Neuron support, IPEX removal, and distributed training fixes

[Compare
Source](https://redirect.github.com/huggingface/accelerate/compare/v1.12.0...v1.13.0)

#### AWS Neuron support

We now have support for AWS Neuron (Trainium/Inferentia) devices. Thanks
[@&#8203;michaelbenayoun](https://redirect.github.com/michaelbenayoun)
for adding this.

- Neuron integration by
[@&#8203;michaelbenayoun](https://redirect.github.com/michaelbenayoun)
in
[#&#8203;3935](https://redirect.github.com/huggingface/accelerate/pull/3935)

##### XPU Improvements

We've removed IPEX dependency and improved device-agnostic code for XPU.

- using spawn instead of fork for XPU device by
[@&#8203;kaixuanliu](https://redirect.github.com/kaixuanliu) in
[#&#8203;3884](https://redirect.github.com/huggingface/accelerate/pull/3884)
- Remove ipex by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3883](https://redirect.github.com/huggingface/accelerate/pull/3883)
- enhance new codes to XPU, and make them be device agnostic by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3890](https://redirect.github.com/huggingface/accelerate/pull/3890)
- Fix KMP\_AFFINITY incorrectly set for non-CPU training by
[@&#8203;hexfaker](https://redirect.github.com/hexfaker) in

[#&#8203;3912](https://redirect.github.com/huggingface/accelerate/pull/3912)

#### FSDP2 Improvements

We've added a bunch of important fixes for FSDP2 users: upcasting only
grad-requiring params, better tied embedding errors, DCP optimizer
loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility.

- Upcast FSDP2 parameters only if requires\_grad by
[@&#8203;ojh31](https://redirect.github.com/ojh31) in
[#&#8203;3848](https://redirect.github.com/huggingface/accelerate/pull/3848)
- Fix FSDP2 tied embedding errors with targeted ValueError guidance by
[@&#8203;amanzoni1](https://redirect.github.com/amanzoni1) in
[#&#8203;3878](https://redirect.github.com/huggingface/accelerate/pull/3878)
- bug: fsdp cannot load optimizer state using dcp by
[@&#8203;flymin](https://redirect.github.com/flymin) in
[#&#8203;3904](https://redirect.github.com/huggingface/accelerate/pull/3904)
- fix crash in optimizer.step when fsdp2 is enabled and model is
bfloat16 by [@&#8203;sywangyi](https://redirect.github.com/sywangyi) in
[#&#8203;3905](https://redirect.github.com/huggingface/accelerate/pull/3905)
- Fix FSDP2 crash with ignored\_params on torch < 2.7.0 by
[@&#8203;Mr-Neutr0n](https://redirect.github.com/Mr-Neutr0n) in
[#&#8203;3924](https://redirect.github.com/huggingface/accelerate/pull/3924)

#### DeepSpeed Sequence Parallelism

We've added several fixes to the DeepSpeed + Sequence Parallelism
integration introduced in v1.12.0, including evaluation support during
SP training and proper process group handling.

- \[SP] fix loss computation example by
[@&#8203;kashif](https://redirect.github.com/kashif) in
[#&#8203;3858](https://redirect.github.com/huggingface/accelerate/pull/3858)
- \[SP and CP] error out if both CP and SP enabled by
[@&#8203;kashif](https://redirect.github.com/kashif) in
[#&#8203;3862](https://redirect.github.com/huggingface/accelerate/pull/3862)
- DeepSpeed has its own process group by
[@&#8203;kashif](https://redirect.github.com/kashif) in
[#&#8203;3916](https://redirect.github.com/huggingface/accelerate/pull/3916)
- \[Deepspeed] skip device mesh creation when deepspeed and sp\_size >1
by [@&#8203;kashif](https://redirect.github.com/kashif) in
[#&#8203;3914](https://redirect.github.com/huggingface/accelerate/pull/3914)
- Enable evaluation during deepspeed Sequence Parallel by
[@&#8203;jp1924](https://redirect.github.com/jp1924) in
[#&#8203;3917](https://redirect.github.com/huggingface/accelerate/pull/3917)

##### FP8

We've enhanced FP8 training. Thanks
[@&#8203;shimizust](https://redirect.github.com/shimizust) for fixing
torchao support.

- Fix FP8 torchao default config with padding and FSDP2 all-gather
support by [@&#8203;shimizust](https://redirect.github.com/shimizust) in
[#&#8203;3831](https://redirect.github.com/huggingface/accelerate/pull/3831)
- Fix execution with Transformer Engine by
[@&#8203;ksivaman](https://redirect.github.com/ksivaman) in
[#&#8203;3852](https://redirect.github.com/huggingface/accelerate/pull/3852)
- add MS-AMP deprecation warnings by
[@&#8203;neha222222](https://redirect.github.com/neha222222) in
[#&#8203;3857](https://redirect.github.com/huggingface/accelerate/pull/3857)

##### Performance

Accelerate now imports faster by deferring heavy dependencies, and
torch.compile hooks are disabled lazily.

- Faster import by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3953](https://redirect.github.com/huggingface/accelerate/pull/3953)
- lazy compile disable by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3947](https://redirect.github.com/huggingface/accelerate/pull/3947)
- Disable hook compile by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3888](https://redirect.github.com/huggingface/accelerate/pull/3888)

##### Minor fixes

- Allow non-Tensor values in a batch with dispatch\_batches=True by
[@&#8203;tomaarsen](https://redirect.github.com/tomaarsen) in
[#&#8203;3850](https://redirect.github.com/huggingface/accelerate/pull/3850)
- fix module and optimizer parameter mismatch before prepare\_tp\_ by
[@&#8203;naomili0924](https://redirect.github.com/naomili0924) in
[#&#8203;3845](https://redirect.github.com/huggingface/accelerate/pull/3845)
- Fix KeyError in extract\_model\_from\_parallel for partial
torch.compile by
[@&#8203;amanzoni1](https://redirect.github.com/amanzoni1) in
[#&#8203;3881](https://redirect.github.com/huggingface/accelerate/pull/3881)
- Fix hf\_device\_map device index comparison in prepare\_model by
[@&#8203;rezaqorbani](https://redirect.github.com/rezaqorbani) in
[#&#8203;3895](https://redirect.github.com/huggingface/accelerate/pull/3895)
- Fix StatefulDataLoader KeyError with num\_workers > 0 by
[@&#8203;veeceey](https://redirect.github.com/veeceey) in
[#&#8203;3931](https://redirect.github.com/huggingface/accelerate/pull/3931)
- Fix stateful dataloader DDP by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3952](https://redirect.github.com/huggingface/accelerate/pull/3952)
- Fix: Remove duplicate W\&B initialization in offline mode by
[@&#8203;shantanugupta2004](https://redirect.github.com/shantanugupta2004)
in
[#&#8203;3886](https://redirect.github.com/huggingface/accelerate/pull/3886)
- Avoid using nvidia-smi on a CPU-only Colab instance by
[@&#8203;FlorianVal](https://redirect.github.com/FlorianVal) in
[#&#8203;3872](https://redirect.github.com/huggingface/accelerate/pull/3872)
- Fix logging logic when in\_order is set to True by
[@&#8203;yuxinyuan](https://redirect.github.com/yuxinyuan) in
[#&#8203;3280](https://redirect.github.com/huggingface/accelerate/pull/3280)
- Fix cpu offload check by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3946](https://redirect.github.com/huggingface/accelerate/pull/3946)
- fix bug when both cpu\_ram\_efficient\_loading and cpu\_offload are
enabled by [@&#8203;kaixuanliu](https://redirect.github.com/kaixuanliu)
in
[#&#8203;3910](https://redirect.github.com/huggingface/accelerate/pull/3910)
- Fix async compatibility across python versions by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3901](https://redirect.github.com/huggingface/accelerate/pull/3901)
- fix tp only bug by
[@&#8203;sywangyi](https://redirect.github.com/sywangyi) in
[#&#8203;3908](https://redirect.github.com/huggingface/accelerate/pull/3908)
- fix parallelism\_config None error by
[@&#8203;jp1924](https://redirect.github.com/jp1924) in
[#&#8203;3927](https://redirect.github.com/huggingface/accelerate/pull/3927)
- Np parall fix by
[@&#8203;sywangyi](https://redirect.github.com/sywangyi) in
[#&#8203;3900](https://redirect.github.com/huggingface/accelerate/pull/3900)
- change the default value of fsdp\_min\_num\_params to int by
[@&#8203;CodeMan62](https://redirect.github.com/CodeMan62) in
[#&#8203;3902](https://redirect.github.com/huggingface/accelerate/pull/3902)
- Fix mutable default in Megatron init and IndexError on empty
ModuleList by
[@&#8203;jashshah999](https://redirect.github.com/jashshah999) in
[#&#8203;3944](https://redirect.github.com/huggingface/accelerate/pull/3944)
- Prepare TP fix by
[@&#8203;michaelbenayoun](https://redirect.github.com/michaelbenayoun)
in
[#&#8203;3945](https://redirect.github.com/huggingface/accelerate/pull/3945)
- feat: added fine tuning example focused on TPUs by
[@&#8203;tengomucho](https://redirect.github.com/tengomucho) in
[#&#8203;3847](https://redirect.github.com/huggingface/accelerate/pull/3847)
- Remove 8bit force hook for bnb by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3907](https://redirect.github.com/huggingface/accelerate/pull/3907)
- docs: flag MS-AMP as deprecated in low-precision training guides by
[@&#8203;ManasVardhan](https://redirect.github.com/ManasVardhan) in
[#&#8203;3929](https://redirect.github.com/huggingface/accelerate/pull/3929)
- fix: correct typo 'guarentee' to 'guarantee' by
[@&#8203;thecaptain789](https://redirect.github.com/thecaptain789) in
[#&#8203;3922](https://redirect.github.com/huggingface/accelerate/pull/3922)
- Updating support of Megatron-LM by
[@&#8203;pengdurice](https://redirect.github.com/pengdurice) in
[#&#8203;3842](https://redirect.github.com/huggingface/accelerate/pull/3842)
- Update support of Megatron-LM PR 2 by
[@&#8203;pengdurice](https://redirect.github.com/pengdurice) in
[#&#8203;3887](https://redirect.github.com/huggingface/accelerate/pull/3887)
- Fix RNG state setting for HPU by
[@&#8203;michaelbenayoun](https://redirect.github.com/michaelbenayoun)
in
[#&#8203;3936](https://redirect.github.com/huggingface/accelerate/pull/3936)
- fix: load the HPU RNG state by
[@&#8203;michaelbenayoun](https://redirect.github.com/michaelbenayoun)
in
[#&#8203;3937](https://redirect.github.com/huggingface/accelerate/pull/3937)

###
[`v1.12.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.12.0):
: Deepspeed Ulysses/ALST

[Compare
Source](https://redirect.github.com/huggingface/accelerate/compare/v1.11.0...v1.12.0)

#### Deepspeed Ulysses/ALST integration

Deepspeed Ulysses/ALST is an efficient way of training on long sequences
by employing sequence parallelism and attention head parallelism. You
can learn more about this technology in this paper
<https://arxiv.org/abs/2506.13996> or this deepspeed tutorial
<https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/>.

<img width="2368" height="1250" alt="0d8bd9e0"
src="https://github.com/user-attachments/assets/b94e90c9-4368-4711-ad57-58de3c714ebc"
/>

To enable Deepspeed Ulysses, you first need to create
`ParallelismConfig` and setting `sp` related args:

```python
parallelism_config = ParallelismConfig(
    sp_backend="deepspeed",
    sp_size=2,
    sp_handler=DeepSpeedSequenceParallelConfig(...),
)
```

Then, you need to make sure to compute the correct loss as described on
our
[docs](https://huggingface.co/docs/accelerate/main/en/concept_guides/sequence_parallelism)

```python
        ...
        losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
        good_tokens = (shift_labels != -100).view(-1).sum()
        good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
        total_loss = sum(
            losses_per_rank[rank] * good_tokens_per_rank[rank]
            for rank in range(sp_world_size)
            if good_tokens_per_rank[rank] > 0
        )
        total_good_tokens = sum(good_tokens_per_rank)
        loss = total_loss / max(total_good_tokens, 1)
```

Thanks [@&#8203;S1ro1](https://redirect.github.com/S1ro1) for starting
this work and for [@&#8203;stas00](https://redirect.github.com/stas00)
for finishing this work. Also thanks
[@&#8203;kashif](https://redirect.github.com/kashif) for adding docs and
reviewing/testing this PR !

This feature will also be available in HF Trainer thanks for this PR
from [@&#8203;stas00](https://redirect.github.com/stas00):
[huggingface/transformers#41832](https://redirect.github.com/huggingface/transformers/pull/41832)

#### Minor changes

- Remove warning for `cpu_ram_efficient_loading` by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3816](https://redirect.github.com/huggingface/accelerate/pull/3816)
- update typo in bnb quantisation 4bit flag docstring by
[@&#8203;hbraith](https://redirect.github.com/hbraith) in
[#&#8203;3828](https://redirect.github.com/huggingface/accelerate/pull/3828)
- ArXiv -> HF Papers by
[@&#8203;qgallouedec](https://redirect.github.com/qgallouedec) in
[#&#8203;3834](https://redirect.github.com/huggingface/accelerate/pull/3834)
- Fix typo in broadcast\_object\_list docstring by
[@&#8203;wsntxxn](https://redirect.github.com/wsntxxn) in
[#&#8203;3823](https://redirect.github.com/huggingface/accelerate/pull/3823)
- \[Bug] Update torch.optim.Optimizer parameter states after tensor
parallelism by
[@&#8203;naomili0924](https://redirect.github.com/naomili0924) in
[#&#8203;3835](https://redirect.github.com/huggingface/accelerate/pull/3835)
- use self hosted runner by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3841](https://redirect.github.com/huggingface/accelerate/pull/3841)
- device type helper by
[@&#8203;kashif](https://redirect.github.com/kashif) in
[#&#8203;3843](https://redirect.github.com/huggingface/accelerate/pull/3843)

#### New Contributors

- [@&#8203;hbraith](https://redirect.github.com/hbraith) made their
first contribution in
[#&#8203;3828](https://redirect.github.com/huggingface/accelerate/pull/3828)
- [@&#8203;wsntxxn](https://redirect.github.com/wsntxxn) made their
first contribution in
[#&#8203;3823](https://redirect.github.com/huggingface/accelerate/pull/3823)
- [@&#8203;naomili0924](https://redirect.github.com/naomili0924) made
their first contribution in
[#&#8203;3835](https://redirect.github.com/huggingface/accelerate/pull/3835)

**Full Changelog**:
<https://github.com/huggingface/accelerate/compare/v1.11.0...v1.12.0>

###
[`v1.11.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.11.0):
: TE MXFP8, FP16/BF16 with MPS, Python 3.10

[Compare
Source](https://redirect.github.com/huggingface/accelerate/compare/v1.10.1...v1.11.0)

#### TE MXFP8 support

We've added support for MXFP8 in our TransformerEngine integration. To
use that, you need to set `use_mxfp8_block_scaling` in `fp8_config`. See
nvidia docs \[here].
(<https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling>)

- Add support for TE MXFP8 recipe in accelerate by
[@&#8203;pstjohn](https://redirect.github.com/pstjohn) in
[#&#8203;3688](https://redirect.github.com/huggingface/accelerate/pull/3688)

#### FP16/BF16 Training for MPS devices

BF16 and FP16 support for MPS devices is finally here. You can now pass
`mixed_precision = "fp16" or "bf16"` when training on a mac (`fp16`
requires torch 2.8 and `bf16` requires torch 2.6)

- Add bf16/fp16 support for amp with mps device by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3373](https://redirect.github.com/huggingface/accelerate/pull/3373)

#### FSDP updates

The following PRs add respectively support to `ignored_params` and
`no_sync()` for FSDPv2:

- feat: add ignored\_params support for fsdp2 by
[@&#8203;kmehant](https://redirect.github.com/kmehant) in
[#&#8203;3731](https://redirect.github.com/huggingface/accelerate/pull/3731)
- fix: model.set\_requires\_gradient\_sync(False) should be called to
turn off gradient synchronization in FSDP2 by
[@&#8203;EquationWalker](https://redirect.github.com/EquationWalker) in
[#&#8203;3762](https://redirect.github.com/huggingface/accelerate/pull/3762)

Mixed precision can now be passed as a dtype string from accelerate cli
flag or `fsdp_config` in accelerate config file:

- feat: allow mixed precision policy as dtype by
[@&#8203;kmehant](https://redirect.github.com/kmehant) in
[#&#8203;3751](https://redirect.github.com/huggingface/accelerate/pull/3751)

#### Nd-parallel updates

Some minor updates concerning nd-parallelism.

- Context Parallelism docs typos fixed by
[@&#8203;sergiopaniego](https://redirect.github.com/sergiopaniego) in
[#&#8203;3761](https://redirect.github.com/huggingface/accelerate/pull/3761)
- Feat: add to\_json by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3743](https://redirect.github.com/huggingface/accelerate/pull/3743)
- make torch\_native\_parallelism examples device agnostic by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3759](https://redirect.github.com/huggingface/accelerate/pull/3759)
- \[ND Parallel] Update examples, cleanup by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3737](https://redirect.github.com/huggingface/accelerate/pull/3737)

#### Bump to Python 3.10

We've dropped support for python 3.9 as it reached EOL in October.

- Bump to python3.10 + update linter by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3809](https://redirect.github.com/huggingface/accelerate/pull/3809)

##### Lots of minor fixes:

- fix: CPU RAM efficient loading for nd or HSDP parallelisms by
[@&#8203;kmehant](https://redirect.github.com/kmehant) in
[#&#8203;3740](https://redirect.github.com/huggingface/accelerate/pull/3740)
- xpu INT64 all\_gather issue fixed in 2.9 by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3756](https://redirect.github.com/huggingface/accelerate/pull/3756)
- Specify device\_ids in torch.distributed.barrier for PartialState by
[@&#8203;qgallouedec](https://redirect.github.com/qgallouedec) in
[#&#8203;3744](https://redirect.github.com/huggingface/accelerate/pull/3744)
- fix: specify device for process\_tensor in example usage by
[@&#8203;qgallouedec](https://redirect.github.com/qgallouedec) in
[#&#8203;3755](https://redirect.github.com/huggingface/accelerate/pull/3755)
- Lower complexity of get\_balanced\_memory by adding a set by
[@&#8203;SamuelBarryCS](https://redirect.github.com/SamuelBarryCS) in
[#&#8203;3776](https://redirect.github.com/huggingface/accelerate/pull/3776)
- Fix (skip) cuda cache flush when origin device is `cpu` and offloaded
to `meta` by [@&#8203;Qubitium](https://redirect.github.com/Qubitium) in
[#&#8203;3796](https://redirect.github.com/huggingface/accelerate/pull/3796)
- Fix convert LayerNorm without bias to fp8 by
[@&#8203;mjun0812](https://redirect.github.com/mjun0812) in
[#&#8203;3725](https://redirect.github.com/huggingface/accelerate/pull/3725)
- Add optional typing by
[@&#8203;cyyever](https://redirect.github.com/cyyever) in
[#&#8203;3769](https://redirect.github.com/huggingface/accelerate/pull/3769)
- refactor: Use `with` in Accelerator.autocast()instead of `
__enter__()` and` __exit__()` for more elegant style. by
[@&#8203;EquationWalker](https://redirect.github.com/EquationWalker) in
[#&#8203;3767](https://redirect.github.com/huggingface/accelerate/pull/3767)
- switch XPU ccl backend to torch-builtin xccl in
test\_zero3\_integration by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3773](https://redirect.github.com/huggingface/accelerate/pull/3773)
- fix FSDP2 test case failure on XPU by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3771](https://redirect.github.com/huggingface/accelerate/pull/3771)
- Fix tests by [@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3722](https://redirect.github.com/huggingface/accelerate/pull/3722)
- Protect import for device\_mesh by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3742](https://redirect.github.com/huggingface/accelerate/pull/3742)
- Fix `SWANLAB_MODE` by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3808](https://redirect.github.com/huggingface/accelerate/pull/3808)
- Fix tracking swanlab by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3810](https://redirect.github.com/huggingface/accelerate/pull/3810)
- refactor: nit change for get\_parameters\_from\_modules (code debt) by
[@&#8203;kmehant](https://redirect.github.com/kmehant) in
[#&#8203;3815](https://redirect.github.com/huggingface/accelerate/pull/3815)
- Remove deprecated FindTiedParametersResult by
[@&#8203;cyyever](https://redirect.github.com/cyyever) in
[#&#8203;3786](https://redirect.github.com/huggingface/accelerate/pull/3786)
- Add optional typing by
[@&#8203;cyyever](https://redirect.github.com/cyyever) in
[#&#8203;3769](https://redirect.github.com/huggingface/accelerate/pull/3769)
- remove mlflow from testing by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3783](https://redirect.github.com/huggingface/accelerate/pull/3783)
- enable 2 model hook ut cases on XPU by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3774](https://redirect.github.com/huggingface/accelerate/pull/3774)
- Added Tip for better rendering by
[@&#8203;sergiopaniego](https://redirect.github.com/sergiopaniego) in
[#&#8203;3781](https://redirect.github.com/huggingface/accelerate/pull/3781)
- Fix typos by [@&#8203;cyyever](https://redirect.github.com/cyyever) in
[#&#8203;3753](https://redirect.github.com/huggingface/accelerate/pull/3753)
- fix: torch\_npu import error in some envs by
[@&#8203;yanyongyu](https://redirect.github.com/yanyongyu) in
[#&#8203;3764](https://redirect.github.com/huggingface/accelerate/pull/3764)
- Fix: typo makes tests fail by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3765](https://redirect.github.com/huggingface/accelerate/pull/3765)
- fix Muti node CUDA error: invalid device ordinal
[#&#8203;3775](https://redirect.github.com/huggingface/accelerate/issues/3775)
by
[@&#8203;RicardoDominguez](https://redirect.github.com/RicardoDominguez)
in
[#&#8203;3779](https://redirect.github.com/huggingface/accelerate/pull/3779)
- use reset\_peak\_memory\_stats on xpu by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3772](https://redirect.github.com/huggingface/accelerate/pull/3772)

#### New Contributors

- [@&#8203;mjun0812](https://redirect.github.com/mjun0812) made their
first contribution in
[#&#8203;3725](https://redirect.github.com/huggingface/accelerate/pull/3725)
- [@&#8203;sergiopaniego](https://redirect.github.com/sergiopaniego)
made their first contribution in
[#&#8203;3761](https://redirect.github.com/huggingface/accelerate/pull/3761)
- [@&#8203;EquationWalker](https://redirect.github.com/EquationWalker)
made their first contribution in
[#&#8203;3762](https://redirect.github.com/huggingface/accelerate/pull/3762)
- [@&#8203;yanyongyu](https://redirect.github.com/yanyongyu) made their
first contribution in
[#&#8203;3764](https://redirect.github.com/huggingface/accelerate/pull/3764)
-
[@&#8203;RicardoDominguez](https://redirect.github.com/RicardoDominguez)
made their first contribution in
[#&#8203;3779](https://redirect.github.com/huggingface/accelerate/pull/3779)
- [@&#8203;SamuelBarryCS](https://redirect.github.com/SamuelBarryCS)
made their first contribution in
[#&#8203;3776](https://redirect.github.com/huggingface/accelerate/pull/3776)
- [@&#8203;Qubitium](https://redirect.github.com/Qubitium) made their
first contribution in
[#&#8203;3796](https://redirect.github.com/huggingface/accelerate/pull/3796)

**Full Changelog**:
<https://github.com/huggingface/accelerate/compare/v1.10.1...v1.11.0>

###
[`v1.10.1`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.10.1):
: Patchfix

[Compare
Source](https://redirect.github.com/huggingface/accelerate/compare/v1.10.0...v1.10.1)

- Feat: add to\_json by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3743](https://redirect.github.com/huggingface/accelerate/pull/3743)
- Protect import for device\_mesh by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3742](https://redirect.github.com/huggingface/accelerate/pull/3742).

**Full Changelog**:
<https://github.com/huggingface/accelerate/compare/v1.10.0...v1.10.1>

###
[`v1.10.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.10.0):
: N-D Parallelism

[Compare
Source](https://redirect.github.com/huggingface/accelerate/compare/v1.9.0...v1.10.0)

### N-D Parallelism

Training large models across multiple GPUs can be complex, especially
when combining [different parallelism
strategies](https://huggingface.co/spaces/nanotron/ultrascale-playbook)
(e.g TP, CP, DP). To simplify this process, we've collaborated with
[Axolotl](https://redirect.github.com/axolotl-ai-cloud/axolotl/) to
introduce an easy-to-use integration that allows you to apply any
combination of parallelism strategies directly in your training script.
Just pass a `ParallelismConfig` specifying the size of each parallelism
type—it's that simple.
Learn more about how it works in our latest
[blogpost](https://redirect.github.com/huggingface/blog/pull/3006).

```python
parallelism_config = ParallelismConfig(
    dp_shard_size=2,
    dp_replicate_size=2,
    cp_size=2,
    tp_size=2,
)
accelerator = Accelerator(
    parallelism_config=parallelism_config,
   ...
)
model = AutoModelForCausalLM.from_pretrained("your-model-name", device_mesh=accelerator.torch_device_mesh)
model = accelerator.prepare(model)
```

- Parallelism config + TP + HSDP + BYODM (Bring Your Own Device Mesh) by
[@&#8203;SalmanMohammadi](https://redirect.github.com/SalmanMohammadi)
in
[#&#8203;3682](https://redirect.github.com/huggingface/accelerate/pull/3682)
- Feat: context parallel v2.0 by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3700](https://redirect.github.com/huggingface/accelerate/pull/3700)
- set default submesh\_tp\_size to prevent unset local variable error by
[@&#8203;winglian](https://redirect.github.com/winglian) in
[#&#8203;3687](https://redirect.github.com/huggingface/accelerate/pull/3687)
- Add Parallelism getter property to Accelerator class by
[@&#8203;WoosungMyung](https://redirect.github.com/WoosungMyung) in
[#&#8203;3703](https://redirect.github.com/huggingface/accelerate/pull/3703)
- Fix: prepare works even if nothing except tp specified (rare) by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3707](https://redirect.github.com/huggingface/accelerate/pull/3707)
- Set parallelism\_config in constructor due to Trainer reset of State
by [@&#8203;winglian](https://redirect.github.com/winglian) in
[#&#8203;3713](https://redirect.github.com/huggingface/accelerate/pull/3713)
- Fix: tp size wouldn't read from env by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3716](https://redirect.github.com/huggingface/accelerate/pull/3716)
- Remove `ParallelismConfig` from `PartialState` by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3720](https://redirect.github.com/huggingface/accelerate/pull/3720)

### FSDP improvements

We've fixed ignored modules attribute. With this, it is now possible to
train PEFT model that moe layers that contrains `q_proj` and `v_proj`
parameters. This is especially important for fine-tuning `gpt-oss`
model.

- ENH: Allow FSDP ignored modules to be regex by
[@&#8203;BenjaminBossan](https://redirect.github.com/BenjaminBossan) in
[#&#8203;3698](https://redirect.github.com/huggingface/accelerate/pull/3698)
- TST Add test for FSDP ignored\_modules as str by
[@&#8203;BenjaminBossan](https://redirect.github.com/BenjaminBossan) in
[#&#8203;3719](https://redirect.github.com/huggingface/accelerate/pull/3719)

### Minor improvements

- feature: CpuOffload pre\_forward don't attempt to move if already on
device by [@&#8203;JoeGaffney](https://redirect.github.com/JoeGaffney)
in
[#&#8203;3695](https://redirect.github.com/huggingface/accelerate/pull/3695)
- Fix: Ensure environment variable values are case-insensitive in
Accelerate by [@&#8203;jp1924](https://redirect.github.com/jp1924) in
[#&#8203;3712](https://redirect.github.com/huggingface/accelerate/pull/3712)
- remove use\_ipex by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3721](https://redirect.github.com/huggingface/accelerate/pull/3721)

### New Contributors

- [@&#8203;SalmanMohammadi](https://redirect.github.com/SalmanMohammadi)
made their first contribution in
[#&#8203;3682](https://redirect.github.com/huggingface/accelerate/pull/3682)
- [@&#8203;WoosungMyung](https://redirect.github.com/WoosungMyung) made
their first contribution in
[#&#8203;3703](https://redirect.github.com/huggingface/accelerate/pull/3703)
- [@&#8203;jp1924](https://redirect.github.com/jp1924) made their first
contribution in
[#&#8203;3712](https://redirect.github.com/huggingface/accelerate/pull/3712)
- [@&#8203;JoeGaffney](https://redirect.github.com/JoeGaffney) made
their first contribution in
[#&#8203;3695](https://redirect.github.com/huggingface/accelerate/pull/3695)

**Full Changelog**:
<https://github.com/huggingface/accelerate/compare/v1.9.0...v1.10.0>

###
[`v1.9.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.9.0):
: Trackio support, Model loading speedup, Minor distributed improvements

[Compare
Source](https://redirect.github.com/huggingface/accelerate/compare/v1.8.1...v1.9.0)

### Trackio tracker support

We've added support for a trackio, lightweight, 💯 free experiment
tracking Python library built on top of 🤗 Datasets and Spaces.

![Screen Recording 2025-06-11 at 5 39
32 PM](https://redirect.github.com/user-attachments/assets/5cf12286-54e7-4119-8a20-88c2cbd37ab6)

Main features are:

- *Local-first* design: dashboard runs locally by default. You can also
host it on Spaces by specifying a `space_id`.
- Persists logs locally (or in a private Hugging Face Dataset)
- Visualize experiments with a Gradio dashboard locally (or on Hugging
Face Spaces)
- Everything here, including hosting on Hugging Faces, is **free**!

To use it with accelerate, you need to set `log_with` and initialize the
trackers

```python
accelerator = Accelerator(log_with="trackio")
config={"learning_rate": 0.001, "batch_size": 32}

# init_kwargs in order to host the dashboard on spaces
init_kwargs = {"trackio": {"space_id": "hf_username/space_name"}
accelerator.init_trackers("example_project", config=config, init_kwargs=init_kwargs})
```

Thanks [@&#8203;pcuenca](https://redirect.github.com/pcuenca) for the
integration !

- trackio by [@&#8203;pcuenca](https://redirect.github.com/pcuenca) in
[#&#8203;3669](https://redirect.github.com/huggingface/accelerate/pull/3669)

#### Model loading speedup when relying `set_module_tensor_to_device `

Setting tensor while clearing cache is very slow, so we added
`clear_device` option to disable it.
Another small optimization is using `non_blocking` everywhere and
syncing just before returning control to the user. This makes the
loading slightly faster.

- Speedup model loading by 4-5x in Diffusers ⚡ by
[@&#8203;a-r-r-o-w](https://redirect.github.com/a-r-r-o-w) in
[#&#8203;3674](https://redirect.github.com/huggingface/accelerate/pull/3674)

#### FDSP, Deepspeed, FP8 minor improvements

- Add support for e5e2 and default to hybrid when launcher is used by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3640](https://redirect.github.com/huggingface/accelerate/pull/3640)
- Fix FP8 tests, enable FP8 to be used without direct `Accelerator()`
configuring by [@&#8203;pstjohn](https://redirect.github.com/pstjohn) in
[#&#8203;3677](https://redirect.github.com/huggingface/accelerate/pull/3677)
- Bunch of FSDP improvements by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3671](https://redirect.github.com/huggingface/accelerate/pull/3671)
- Fix: properly error when DDP + Dtensor model by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3629](https://redirect.github.com/huggingface/accelerate/pull/3629)
- Fix fsdp2 example typo by
[@&#8203;shimizust](https://redirect.github.com/shimizust) in
[#&#8203;3657](https://redirect.github.com/huggingface/accelerate/pull/3657)
- Added a check in no\_sync() to avoid errors when using deepspeed
zero2/3 by [@&#8203;xliu0105](https://redirect.github.com/xliu0105) in
[#&#8203;3656](https://redirect.github.com/huggingface/accelerate/pull/3656)

#### 🚨🚨🚨 Breaking changes 🚨🚨🚨

`find_executable_batch_size()` will no longer halves the batch after
every OOM. Instead, we will multiply the batch size by 0.9. This should
help user not waste gpu capacity.

- “Stop Halving My Batch!” · Default back-off 0.5 → 0.9 by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3684](https://redirect.github.com/huggingface/accelerate/pull/3684)

#### What's Changed

- \[typo] shards instead of shard by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3645](https://redirect.github.com/huggingface/accelerate/pull/3645)
- Docs: Fix typos in gradient accumulation guide by
[@&#8203;kilavvy](https://redirect.github.com/kilavvy) in
[#&#8203;3649](https://redirect.github.com/huggingface/accelerate/pull/3649)
- xpu enablement on left cases by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3654](https://redirect.github.com/huggingface/accelerate/pull/3654)
- unpin datasets in examples requirements by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3681](https://redirect.github.com/huggingface/accelerate/pull/3681)
- fix: wandb config not saved in offline mode by
[@&#8203;ved1beta](https://redirect.github.com/ved1beta) in
[#&#8203;3648](https://redirect.github.com/huggingface/accelerate/pull/3648)
- accelerate/data\_loader.py: do not yield if the base\_dataloader is
empty by [@&#8203;0xnightwind](https://redirect.github.com/0xnightwind)
in
[#&#8203;3659](https://redirect.github.com/huggingface/accelerate/pull/3659)
- warn for invalid keys by
[@&#8203;ved1beta](https://redirect.github.com/ved1beta) in
[#&#8203;3613](https://redirect.github.com/huggingface/accelerate/pull/3613)
- Update Gaudi runner image to latest SynapseAI and enable previously
disabled tests by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3653](https://redirect.github.com/huggingface/accelerate/pull/3653)

#### New Contributors

- [@&#8203;kilavvy](https://redirect.github.com/kilavvy) made their
first contribution in
[#&#8203;3649](https://redirect.github.com/huggingface/accelerate/pull/3649)
- [@&#8203;shimizust](https://redirect.github.com/shimizust) made their
first contribution in
[#&#8203;3657](https://redirect.github.com/huggingface/accelerate/pull/3657)
- [@&#8203;xliu0105](https://redirect.github.com/xliu0105) made their
first contribution in
[#&#8203;3656](https://redirect.github.com/huggingface/accelerate/pull/3656)
- [@&#8203;0xnightwind](https://redirect.github.com/0xnightwind) made
their first contribution in
[#&#8203;3659](https://redirect.github.com/huggingface/accelerate/pull/3659)

**Full Changelog**:
<https://github.com/huggingface/accelerate/compare/v1.8.1...v1.9.0>

###
[`v1.8.1`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.8.1):
: Patchfix

[Compare
Source](https://redirect.github.com/huggingface/accelerate/compare/v1.8.0...v1.8.1)

- Add support for e5e2 and default to hybrid when launcher is used by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3640](https://redirect.github.com/huggingface/accelerate/pull/3640)
- shards by [@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3645](https://redirect.github.com/huggingface/accelerate/pull/3645)

**Full Changelog**:
<https://github.com/huggingface/accelerate/compare/v1.8.0...v1.8.1>

###
[`v1.8.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.8.0):
: FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed
Training on Intel CPUs, ipex.optimize deprecation

[Compare
Source](https://redirect.github.com/huggingface/accelerate/compare/v1.7.0...v1.8.0)

### FSDPv2 refactor + FP8 support

We've simplified how to prepare FSDPv2 models, as there were too many
ways to compose FSDP2 with other features (e.g., FP8, torch.compile,
activation checkpointing, etc.). Although the setup is now more
restrictive, it leads to fewer errors and a more performant user
experience. We’ve also added support for FP8. You can read about the
results
[here](https://redirect.github.com/huggingface/accelerate/tree/main/examples/fsdp2).
Thanks to [@&#8203;S1ro1](https://redirect.github.com/S1ro1) for this
contribution!

- \[FSDP2] Refactor + FP8 by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3585](https://redirect.github.com/huggingface/accelerate/pull/3585)

### Faster Distributed Training on Intel CPUs

We updated the `CCL_WORKER_COUNT` variable and added `KMP` parameters
for Intel CPU users. This significantly improves distributed training
performance (e.g., Tensor Parallelism), with up to a 40% speed-up on
Intel 4th Gen Xeon when training transformer TP models.

- Set ccl and KMP param in simple launch by
[@&#8203;jiqing-feng](https://redirect.github.com/jiqing-feng) in
[#&#8203;3575](https://redirect.github.com/huggingface/accelerate/pull/3575)

### Regional Compilation for DeepSpeed

We added support for regional compilation with the DeepSpeed engine.
DeepSpeed’s .compile() modifies models in-place using
torch.nn.Module.compile(...), rather than the out-of-place
torch.compile(...), so we had to account for that. Thanks
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
for this feature!

- Fix deepspeed regional compilation by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3609](https://redirect.github.com/huggingface/accelerate/pull/3609)

### ipex.optimize deprecation

`ipex.optimize` is being deprecated. Most optimizations have been
upstreamed to PyTorch, and future improvements will land there directly.
For users without PyTorch 2.8, we’ll continue to rely on IPEX for now.

- remove ipex.optimize in accelerate by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3608](https://redirect.github.com/huggingface/accelerate/pull/3608)

### Better XPU Support

We've greatly expanded and stabilized support for Intel XPUs:

- enable fsdp2 benchmark on XPU by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3590](https://redirect.github.com/huggingface/accelerate/pull/3590)
- enable big\_model\_inference on xpu by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3595](https://redirect.github.com/huggingface/accelerate/pull/3595)
- enable test\_load\_checkpoint\_and\_dispatch\_with\_broadcast cases on
XPU by [@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
- enable test\_cli & test\_example cases on XPU by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3578](https://redirect.github.com/huggingface/accelerate/pull/3578)
- enable torchao and pippy test cases on XPU by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3599](https://redirect.github.com/huggingface/accelerate/pull/3599)
- enable regional\_compilation benchmark on xpu by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3592](https://redirect.github.com/huggingface/accelerate/pull/3592)
- fix xpu 8bit value loading by
[@&#8203;jiqing-feng](https://redirect.github.com/jiqing-feng) in
[#&#8203;3623](https://redirect.github.com/huggingface/accelerate/pull/3623)
- add device-agnostic GradScaler by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3588](https://redirect.github.com/huggingface/accelerate/pull/3588)
- add xpu support in TorchTensorParallelPlugin by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3627](https://redirect.github.com/huggingface/accelerate/pull/3627)

### Trackers

We've added support for
[SwanLab](https://redirect.github.com/SwanHubX/SwanLab) as an experiment
tracking backend. Huge thanks to
[@&#8203;ShaohonChen](https://redirect.github.com/ShaohonChen) for this
contribution ! We also deferred all tracker initializations to prevent
premature setup of distributed environments.

- Integrate SwanLab for offline/online experiment tracking for
Accelerate by
[@&#8203;ShaohonChen](https://redirect.github.com/ShaohonChen) in
[#&#8203;3605](https://redirect.github.com/huggingface/accelerate/pull/3605)
- Fix: Defer Tracker Initialization to Prevent Premature Distributed
Setup by [@&#8203;yuanjua](https://redirect.github.com/yuanjua) in
[#&#8203;3581](https://redirect.github.com/huggingface/accelerate/pull/3581)

#### What's Changed

- Fix bf16 training with TP by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3610](https://redirect.github.com/huggingface/accelerate/pull/3610)
- better handle FP8 with and without deepspeed by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3611](https://redirect.github.com/huggingface/accelerate/pull/3611)
- Update Gaudi Runners by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3593](https://redirect.github.com/huggingface/accelerate/pull/3593)
- goodbye torch\_ccl by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3580](https://redirect.github.com/huggingface/accelerate/pull/3580)
- Add support for standalone mode when default port is occupied on
single node by
[@&#8203;laitifranz](https://redirect.github.com/laitifranz) in
[#&#8203;3576](https://redirect.github.com/huggingface/accelerate/pull/3576)
- Resolve logger warnings by
[@&#8203;emmanuel-ferdman](https://redirect.github.com/emmanuel-ferdman)
in
[#&#8203;3582](https://redirect.github.com/huggingface/accelerate/pull/3582)
- Add kwargs to optimizer, scheduler and dataloader using function
`accelerator().load_state()` by
[@&#8203;luiz0992](https://redirect.github.com/luiz0992) in
[#&#8203;3540](https://redirect.github.com/huggingface/accelerate/pull/3540)
- \[docs] no hard-coded cuda in the ddp documentation by
[@&#8203;faaany](https://redirect.github.com/faaany) in
[#&#8203;3589](https://redirect.github.com/huggingface/accelerate/pull/3589)
- change to use torch.device by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3594](https://redirect.github.com/huggingface/accelerate/pull/3594)
- Fix: list object has no attribute keys by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3603](https://redirect.github.com/huggingface/accelerate/pull/3603)
- Update Gaudi Runners by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3593](https://redirect.github.com/huggingface/accelerate/pull/3593)
- Fix bf16 training with TP by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3610](https://redirect.github.com/huggingface/accelerate/pull/3610)
- better handle FP8 with and without deepspeed by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3611](https://redirect.github.com/huggingface/accelerate/pull/3611)
- Remove device\_count for TPU launcher to avoid initializing runtime by
[@&#8203;sorgfresser](https://redirect.github.com/sorgfresser) in
[#&#8203;3587](https://redirect.github.com/huggingface/accelerate/pull/3587)
- Fix missing te.LayerNorm in intel\_transformer\_engine by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3619](https://redirect.github.com/huggingface/accelerate/pull/3619)
- Add fp8\_e5m2 support in `dtype_byte_size` by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3625](https://redirect.github.com/huggingface/accelerate/pull/3625)
- \[Deepspeed] deepspeed auto grad accum by
[@&#8203;kashif](https://redirect.github.com/kashif) in
[#&#8203;3630](https://redirect.github.com/huggingface/accelerate/pull/3630)
- Remove hardcoded cuda from fsdpv2 by
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
in
[#&#8203;3631](https://redirect.github.com/huggingface/accelerate/pull/3631)
- Integrate SwanLab for offline/online experiment tracking for
Accelerate by
[@&#8203;ShaohonChen](https://redirect.github.com/ShaohonChen) in
[#&#8203;3605](https://redirect.github.com/huggingface/accelerate/pull/3605)
- Fix Typos in Documentation and Comments by
[@&#8203;leopardracer](https://redirect.github.com/leopardracer) in
[#&#8203;3621](https://redirect.github.com/huggingface/accelerate/pull/3621)
- feat: use datasets.IterableDataset shard if possible by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3635](https://redirect.github.com/huggingface/accelerate/pull/3635)
- \[DeepSpeed] sync gradient accum steps from deepspeed plugin by
[@&#8203;kashif](https://redirect.github.com/kashif) in
[#&#8203;3632](https://redirect.github.com/huggingface/accelerate/pull/3632)
- Feat: add cpu offload by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3636](https://redirect.github.com/huggingface/accelerate/pull/3636)
- Fix: correct labels for fsdp2 examples by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3637](https://redirect.github.com/huggingface/accelerate/pull/3637)
- fix grad acc deepspeed by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3638](https://redirect.github.com/huggingface/accelerate/pull/3638)

#### New Contributors

- [@&#8203;laitifranz](https://redirect.github.com/laitifranz) made
their first contribution in
[#&#8203;3576](https://redirect.github.com/huggingface/accelerate/pull/3576)
-
[@&#8203;emmanuel-ferdman](https://redirect.github.com/emmanuel-ferdman)
made their first contribution in
[#&#8203;3582](https://redirect.github.com/huggingface/accelerate/pull/3582)
- [@&#8203;yuanjua](https://redirect.github.com/yuanjua) made their
first contribution in
[#&#8203;3581](https://redirect.github.com/huggingface/accelerate/pull/3581)
- [@&#8203;sorgfresser](https://redirect.github.com/sorgfresser) made
their first contribution in
[#&#8203;3587](https://redirect.github.com/huggingface/accelerate/pull/3587)
- [@&#8203;ShaohonChen](https://redirect.github.com/ShaohonChen) made
their first contribution in
[#&#8203;3605](https://redirect.github.com/huggingface/accelerate/pull/3605)
- [@&#8203;leopardracer](https://redirect.github.com/leopardracer) made
their first contribution in
[#&#8203;3621](https://redirect.github.com/huggingface/accelerate/pull/3621)

**Full Changelog**:
<https://github.com/huggingface/accelerate/compare/v1.7.0...v1.8.0>

###
[`v1.7.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.7.0):
: Regional compilation, Layerwise casting hook, FSDPv2 + QLoRA

[Compare
Source](https://redirect.github.com/huggingface/accelerate/compare/v1.6.0...v1.7.0)

### Regional compilation

Instead of compiling the entire model at once, regional compilation
targets repeated blocks (such as decoder layers) first. This allows the
compiler to cache and reuse optimized code for subsequent blocks,
significantly reducing the cold start compilation time typically seen
during the first inference. Thanks
[@&#8203;IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil)
for the feature ! You can view the full benchmark
[here](https://redirect.github.com/huggingface/accelerate/tree/main/benchmarks/torch.compile),
and check out our updated [compilation
guide](https://huggingface.co/docs/accelerate/en/usage_guides/compilation)
for more details!


![compilation\_time-1](https://redirect.github.com/user-attachments/assets/38795d12-6ee7-4a10-84c6-d29a0877e36c)

To enable this feature, set `use_regional_compilation=True` in the
`TorchDynamoPlugin` configuration.

```python

# Configure the compilation backend
dynamo_plugin = TorchDynamoPlugin(
    use_regional_compilation=True,
    ... # other parameters
)

# Initialize accelerator with the plugin
accelerator = Accelerator(dynamo_plugin=dynamo_plugin)

# This will apply compile_regions to your model
model = accelerator.prepare(model)
```

### Layerwise casting hook

We've introduced a new hook that enables per-layer upcasting and
downcasting (e.g., for Linear layers) during inference. This allows
users to run models with separate storage and compute dtypes, resulting
in memory savings. The concept was first implemented in
[diffusers](https://huggingface.co/docs/diffusers/main/en/optimization/memory#layerwise-casting),
where downcasting models to FP8 proved effective without major quality
degradation. Contributed by
[@&#8203;sayakpaul](https://redirect.github.com/sayakpaul) in
[#&#8203;3427](https://redirect.github.com/huggingface/accelerate/pull/3427)

```python
model = ....
storage_dtype = torch.float8_e4m3fn
compute_dtype = torch.bfloat16
attach_layerwise_casting_hooks(
            model,
            storage_dtype=storage_dtype,
            compute_dtype=compute_dtype,
        )
```

### Better FSDP2 support

This release includes numerous new features and bug fixes. Notably,
we’ve added support for `FULL_STATE_DICT`, a widely used option in FSDP,
now enabling `.save_pretrained()` in transformers to work with FSDP2
wrapped models. QLoRA training is now supported as well but more testing
is needed. We have also resolved a backend issue related to parameter
offloading to CPU. Additionally, a significant memory spike that
occurred when `cpu_ram_efficient_loading=True` was enabled has been
fixed. Several other minor improvements and fixes are also included—see
the **What’s Changed** section for full details.

- `FULL_STATE_DICT` have been enabled by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3527](https://redirect.github.com/huggingface/accelerate/pull/3527)
- QLoRA support by
[@&#8203;winglian](https://redirect.github.com/winglian) in
[#&#8203;3546](https://redirect.github.com/huggingface/accelerate/pull/3546)
- set backend correctly for CUDA+FSDP2+cpu-offload in
[#&#8203;3574](https://redirect.github.com/huggingface/accelerate/pull/3574)
- memory spike fixed when using `cpu_ram_efficient_loading=True` by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3482](https://redirect.github.com/huggingface/accelerate/pull/3482)

### Better HPU support:

We have added a
[documentation](https://huggingface.co/docs/accelerate/en/usage_guides/gaudi)
for Intel Gaudi hardware !
The support is already available since v1.5.0 through this
[PR](https://redirect.github.com/huggingface/accelerate/pull/3378).

- Add the HPU into accelerate config by
[@&#8203;yuanwu2017](https://redirect.github.com/yuanwu2017) in
[#&#8203;3495](https://redirect.github.com/huggingface/accelerate/pull/3495)
- Add Gaudi doc by
[@&#8203;regisss](https://redirect.github.com/regisss) in
[#&#8203;3537](https://redirect.github.com/huggingface/accelerate/pull/3537)

### Torch.compile breaking change for `dynamic` argument

We've updated the logic for setting `self.dynamic` to explicitly
preserve None rather than defaulting to `False` when the `USE_DYNAMIC`
environment variable is unset. This change aligns the behavior with the
PyTorch documentation for
[torch.compile](https://docs.pytorch.org/stable/generated/torch.compile.html).
Thanks to [@&#8203;yafshar](https://redirect.github.com/yafshar) for
contributing this improvement in
[#&#8203;3567](https://redirect.github.com/huggingface/accelerate/pull/3567).

#### What's Changed

- use device agnostic torch.OutOfMemoryError from pytorch 2.5.0 by
[@&#8203;yao-matrix](https://redirect.github.com/yao-matrix) in
[#&#8203;3475](https://redirect.github.com/huggingface/accelerate/pull/3475)
- Adds style bot by
[@&#8203;zach-huggingface](https://redirect.github.com/zach-huggingface)
in
[#&#8203;3478](https://redirect.github.com/huggingface/accelerate/pull/3478)
- Fix a tiny typo in `low_precision_training` guide by
[@&#8203;sadra-barikbin](https://redirect.github.com/sadra-barikbin) in
[#&#8203;3488](https://redirect.github.com/huggingface/accelerate/pull/3488)
- Fix check\_tied\_parameters\_in\_config for multimodal models by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3479](https://redirect.github.com/huggingface/accelerate/pull/3479)
- Don't create new param for TorchAO sequential offloading due to weak
BC guarantees by
[@&#8203;a-r-r-o-w](https://redirect.github.com/a-r-r-o-w) in
[#&#8203;3444](https://redirect.github.com/huggingface/accelerate/pull/3444)
- add support for custom function for reducing the batch size by
[@&#8203;winglian](https://redirect.github.com/winglian) in
[#&#8203;3071](https://redirect.github.com/huggingface/accelerate/pull/3071)
- Fix fp8 deepspeed config by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3492](https://redirect.github.com/huggingface/accelerate/pull/3492)
- fix warning error by
[@&#8203;faaany](https://redirect.github.com/faaany) in
[#&#8203;3491](https://redirect.github.com/huggingface/accelerate/pull/3491)
- \[bug] unsafe\_serialization option in "merge-weights" doesn't work by
[@&#8203;cyr0930](https://redirect.github.com/cyr0930) in
[#&#8203;3496](https://redirect.github.com/huggingface/accelerate/pull/3496)
- Add the HPU into accelerate config by
[@&#8203;yuanwu2017](https://redirect.github.com/yuanwu2017) in
[#&#8203;3495](https://redirect.github.com/huggingface/accelerate/pull/3495)
- Use `torch.distributed.checkpoint.state_dict.set_model_state_dict` in
`load_checkpoint_in_model` by
[@&#8203;ringohoffman](https://redirect.github.com/ringohoffman) in
[#&#8203;3432](https://redirect.github.com/huggingface/accelerate/pull/3432)
- nit: needed sanity checks for fsdp2 by
[@&#8203;kmehant](https://redirect.github.com/kmehant) in
[#&#8203;3499](https://redirect.github.com/huggingface/accelerate/pull/3499)
- (Part 1) fix: make TP training compatible with new transformers by
[@&#8203;kmehant](https://redirect.github.com/kmehant) in
[#&#8203;3457](https://redirect.github.com/huggingface/accelerate/pull/3457)
- Fix deepspeed tests by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3503](https://redirect.github.com/huggingface/accelerate/pull/3503)
- Add FP8 runners + tweak building FP8 image by
[@&#8203;zach-huggingface](https://redirect.github.com/zach-huggingface)
in
[#&#8203;3493](https://redirect.github.com/huggingface/accelerate/pull/3493)
- fix: apply torchfix to set `weights_only=True` by
[@&#8203;bzhong-solink](https://redirect.github.com/bzhong-solink) in
[#&#8203;3497](https://redirect.github.com/huggingface/accelerate/pull/3497)
- Fix: require transformers version for tp tests by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3504](https://redirect.github.com/huggingface/accelerate/pull/3504)
- Remove deprecated PyTorch/XLA APIs by
[@&#8203;zpcore](https://redirect.github.com/zpcore) in
[#&#8203;3484](https://redirect.github.com/huggingface/accelerate/pull/3484)
- Fix cache issue by upgrading github actions version by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3513](https://redirect.github.com/huggingface/accelerate/pull/3513)
- \[Feat] Layerwise casting hook by
[@&#8203;sayakpaul](https://redirect.github.com/sayakpaul) in
[#&#8203;3427](https://redirect.github.com/huggingface/accelerate/pull/3427)
- Add torchao to FP8 error message by
[@&#8203;jphme](https://redirect.github.com/jphme) in
[#&#8203;3514](https://redirect.github.com/huggingface/accelerate/pull/3514)
- Fix unwanted cuda init due to torchao by
[@&#8203;SunMarc](https://redirect.github.com/SunMarc) in
[#&#8203;3530](https://redirect.github.com/huggingface/accelerate/pull/3530)
- Solve link error in internal\_mechanism documentation
([#&#8203;3506](https://redirect.github.com/huggingface/accelerate/issues/3506))
by [@&#8203;alvaro-mazcu](https://redirect.github.com/alvaro-mazcu) in
[#&#8203;3507](https://redirect.github.com/huggingface/accelerate/pull/3507)
- \[FSDP2] Enable FULL\_STATE\_DICT by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3527](https://redirect.github.com/huggingface/accelerate/pull/3527)
- \[FSDP2] Fix memory spike with `cpu_ram_efficient_loading=True` by
[@&#8203;S1ro1](https://redirect.github.com/S1ro1) in
[#&#8203;3482](https://redirect.github.com/huggingface/accelerate/pull/3482)
- \[FSDP2] Issues in Wrap Policy and Mixed Precision by
[@&#8203;jhliu17](https://redirect.github.com/jhliu17) in
[#&#8203;3528](https://redire

> ✂ **Note**
> 
> PR body was truncated to here.


</details>

---

### Configuration

📅 **Schedule**: (UTC)

- Branch creation
  - At any time (no schedule defined)
- Automerge
  - At any time (no schedule defined)

🚦 **Automerge**: Enabled.

♻ **Rebasing**: Whenever PR is behind base branch, or you tick the
rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about this update
again.

---

- [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check
this box

---

This PR has been generated by [Mend
Renovate](https://redirect.github.com/renovatebot/renovate).

<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiI0My4yMDkuMiIsInVwZGF0ZWRJblZlciI6IjQzLjIwOS4yIiwidGFyZ2V0QnJhbmNoIjoibWFpbiIsImxhYmVscyI6W119-->

Co-authored-by: aar-public-version-bump-bot[bot] <286693160+aar-public-version-bump-bot[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants