HF Trainer: ALST/Ulysses sequence parallelism integration via HF Accelerate by stas00 · Pull Request #41832 · huggingface/transformers

stas00 · 2025-10-23T21:55:53Z

Integrates HF Accelerate's support for ALST/Ulysses sequence parallelism huggingface/accelerate#3817 into HF Trainer

TODO:

docs - no idea where? the FSDP/CP is not documented, or any parallelism for that matter.
tests
need to merge Deepspeed Ulysses/ALST integration accelerate#3817
need to wait for a new HF Accelerate release and to use its version in compatibility checks in the code 1.11.1 most likely
need to wait for a merge of ulysses mpu: additional api deepspeedai/DeepSpeed#7649
need to wait for a new Deepspeed release after above is merged

…lerate Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

stas00 · 2025-10-23T22:26:56Z

I have a hard time finding where I can add the documentation for this new backend. Context parallelism isn't documented anywhere in HF Trainer - how can users discover it? If I'm missing it, could you please point me to where I should extend the documentation? Thank you.

Same story with CP tests - there are none :( so need to figure out how to write some.

stas00 · 2025-10-24T00:32:55Z

OK, the next issue with the existing integration of CP/FSDP. What does the following mean?

 $ sometrainerscript.py --help
 [...]
 --parallelism_config PARALLELISM_CONFIG, --parallelism-config PARALLELISM_CONFIG

this arg tells users absolutely nothing about what value(s) to pass to --parallelism_config. If it's not meant to be in CLI args and can only be used explicitly by writing code perhaps it shouldn't be listed in --help or at least say that it has to be coded?

Rocketknight1 · 2025-10-24T10:59:47Z

cc @SunMarc

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

stas00 · 2025-10-28T02:29:26Z

        return env

-    def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None):
+    def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None, return_pathlib_obj=False):


this is a really old version. In the latest incarnation it always return a Path object. But to keep BC, I added a new flag here instead. The tests are less clunkier then.

The latest version is here: https://github.com/stas00/ml-engineering/blob/master/testing/testing_utils.py

If wanted you could switch to the latest version instead and adapt tests to simplify.

cc @ydshieh

it's much better for it to always return a pathlib.Path object but you'd need to tweak a few tests which use this API.

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

stas00 · 2025-10-28T02:41:10Z

@SunMarc, this is ready for a review. Tests fail because they need the accelerate PR huggingface/accelerate#3817

I just didn't know where to update docs since parallelism doesn't seem to be documented here at all. Please correct me if I'm wrong.

Thanks to @kashif with the test.

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

SunMarc

Thanks for this clean integration ! Left a couple of comments. It would be great @kashif @qgallouedec if you can have a look at this PR so that we can also make it compatible with TRL.

SunMarc · 2025-11-04T11:11:35Z

        return env

-    def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None):
+    def get_auto_remove_tmp_dir(self, tmp_dir=None, before=None, after=None, return_pathlib_obj=False):


cc @ydshieh

SunMarc · 2025-11-04T12:36:42Z

+                losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group)
+                # special dealing with SFT that has prompt tokens that aren't used in loss computation
+                good_tokens = (shift_labels != -100).view(-1).sum()
+                good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group)
+                total_loss = sum(losses_per_rank[rank] * good_tokens_per_rank[rank] for rank in range(sp_world_size))
+                total_good_tokens = sum(good_tokens_per_rank)
+                loss = total_loss / max(total_good_tokens, 1)


We probably don't need to do this if num_items_in_batch is computed and passed in in unwrapped_model.loss_function. num_items_in_batch was introduced to fix the gradient accumulation https://unsloth.ai/blog/gradient. num_items_in_batch is basically total_good_tokens if grad_acc = 1.

sorry, what is not needed?

This code block is because we need to compute the correct loss across SP ranks. If you just average those it'll be incorrect in the case of -100 masked tokens (SFT), since each rank is likely to process a different number of unmasked tokens (this is not DP averaging).

Unless what you mean is that we don't need to calculate total_good_tokens since num_items_in_batch is already that, but the rest of the code remains - did I understand you correctly?

if you pass num_items_in_batch in loss_function, it will sum the loss then divide it by num_items_in_batch directly. This way I think we don't need to actually to recalculate the total_loss from the averaged losses and the good_tokens_per_rank. Maybe I'm wrong so please correct me ! But I think this might solve the grad acc issue. In any case, we will keep the current code as not all models accepts num_items_in_batch when calculating the loss.

total_loss = sum(losses_per_rank[rank] * good_tokens_per_rank[rank] for rank in range(sp_world_size))

def ForCausalLMLoss( logits, labels, vocab_size: int, num_items_in_batch: Optional[int] = None, ignore_index: int = -100, shift_labels: Optional[torch.Tensor] = None, **kwargs, ) -> torch.Tensor: # Upcast to float if we need to compute the loss to avoid potential precision issues logits = logits.float() if shift_labels is None: # Shift so that tokens < n predict n labels = nn.functional.pad(labels, (0, 1), value=ignore_index) shift_labels = labels[..., 1:].contiguous() # Flatten the tokens logits = logits.view(-1, vocab_size) shift_labels = shift_labels.view(-1) # Enable model parallelism shift_labels = shift_labels.to(logits.device) loss = fixed_cross_entropy(logits, shift_labels, num_items_in_batch, ignore_index, **kwargs) return loss def fixed_cross_entropy( source: torch.Tensor, target: torch.Tensor, num_items_in_batch: Optional[int] = None, ignore_index: int = -100, **kwargs, ) -> torch.Tensor: reduction = "sum" if num_items_in_batch is not None else "mean" loss = nn.functional.cross_entropy(source, target, ignore_index=ignore_index, reduction=reduction) if reduction == "sum": loss = loss / num_items_in_batch return loss

If you pass num_items_in_batch you indeed don't need to do local loss calculation since it'll do that already. But we need to calculate a distributed across ranks loss.

Here is an example: Let's take a 2k tokens sample SP-split across 2 ranks using SFT:

SP rank0 - 900 masked and 100 non-masked tokens (a long initial prompt that is -100 masked out)

SP rank1 - 100 masked and 900 non-masked tokens

So each rank produces the correct loss if we use num_items_in_batch - but how do you combine the losses of 2 ranks. straight average will give a very skewed result, because the rank0's loss contributes 9x less non-masked tokens.

Let's take it to a more telling example:

SP rank0 - 1000 masked and 0 non-masked tokens (a long initial prompt that is masked out)

SP rank1 - 0 masked and 1000 non-masked tokens

here rank0 can't even contribute anything to the total loss - a normal averaging of 2 losses would be completely broken, since you'd average with an undefined behavior, since the loss function will return a NaN or None.

So each rank produces the correct loss if we use num_items_in_batch - but how do you combine the losses of 2 ranks. straight average will give a very skewed result, because the rank0's loss contributes 9x less non-masked tokens.

The denominator of the losses is both num_items_in_batch, the value of each loss already takes into account the number of non-masked tokens as we do reduction = "sum". So we just sum them to get the final loss. In your first examples, num_items_in_batch will be equal to 1000. For rank0, the loss will be equal to (L1+...L100)/1000 and for rank1, it will be (l1+..+l900)/1000

I have a feeling we are missing each other. I'm talking about differentiable loss combination across ranks and I think you're talking about the local rank's loss.

Could you please point me to the code in HF Trainer that performs a differentiable loss combination across multiple ranks? I couldn't find any.

kashif · 2025-11-04T13:20:33Z

Thanks @SunMarc I did an initial run with default settings in TRL and these branches, and it all worked nicely (apart from model saving due to not updating deepspeed i think). I will check the config that uses a bespoke compute_loss in the sft trainer, thanks for the heads up!

SunMarc · 2025-11-04T13:22:20Z

@SunMarc, this is ready for a review. Tests fail because they need the accelerate PR huggingface/accelerate#3817

I just didn't know where to update docs since parallelism doesn't seem to be documented here at all. Please correct me if I'm wrong.

Thanks to @kashif with the test.

We indeed do not have docs related to that. @kashif as you added cp support in trainer, would you be willing to add some docs around that.
@stas00, I think we can either update deepspeed docs https://huggingface.co/docs/transformers/main/en/deepspeed and/or create a new docs called contextparallel like in accelerate.

HuggingFaceDocBuilderDev · 2025-11-04T13:31:35Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

kashif · 2025-11-04T13:48:03Z

@SunMarc docs on TRL side: huggingface/trl#4420

kashif · 2025-11-04T14:24:42Z

i can look at the docs on the accelerate side as well...

stas00 · 2025-11-04T17:01:13Z

@stas00, I think we can either update deepspeed docs https://huggingface.co/docs/transformers/main/en/deepspeed and/or create a new docs called contextparallel like in accelerate.

Ideally we would have a dedicated doc like you suggested, which could then link into deepspeed for nuances as one way to do that. The key is for the user to quickly understand what's possible, thus a single context parallel entry point doc would be very useful to users.

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

zhangwj618 · 2025-11-20T02:39:40Z

Is there a mismatch between the docs and the code? docs/source/en/deepspeed.md says:
"By default, when you only configure sp_size, DP is automatically calculated as dp_size = world_size / sp_size."
However, when I run the code with sp_size != world_size, I get this error, unless I specify dp_replicate_size manully.

kashif · 2025-11-20T08:42:40Z

@zhangwj618 the doc is wrong... my bad let me fix it!

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

ArthurZucker · 2025-11-21T13:19:55Z

Thanks a lot @stas00 for your work 🤗

stas00 · 2025-11-21T17:41:39Z

super! Thanks a lot to Marc and Kashif for help with integration and Weijie Zhang for being the first early adopter!

@kashif

…lerate (huggingface#41832) * HF Trainer: ALST/Ulysses sequence parallelism integration via HF Accelerate Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * make it work + tests Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * cleanup Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * undo Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * normalize Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * always return cp_size Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * cleanup Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * extract code into _deepspeed_cp_compute_loss Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * fix Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * ALST/Ulysses sequence parallelism docs * typo * add link to UlyssesSPDataLoaderAdapter * adapt to renaming to SP Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * improve Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * fix Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Update docs/source/en/deepspeed.md Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * address comments Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * address comments Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Update src/transformers/trainer.py Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * address comments Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * address comments Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Update src/transformers/trainer.py Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * Update src/transformers/trainer.py Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * style Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> * Update docs/source/en/deepspeed.md Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * Update docs/source/en/deepspeed.md Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * Account for Sequence Parallelism (SP) dataloader adapter effect * Update src/transformers/trainer.py Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * Update docs/source/en/deepspeed.md Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * Update docs/source/en/deepspeed.md Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> * model_accepts_loss_kwargs to False * better comment * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestions from code review * Apply suggestion from @kashif * Apply suggestion from @kashif * Apply suggestion from @kashif * Update src/transformers/trainer.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Update src/transformers/training_args.py Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com> * Apply suggestion from @kashif * Apply suggestion from @kashif --------- Signed-off-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Stas Bekman <stas.bekman@snowflake.com> Co-authored-by: Kashif Rasul <kashif.rasul@gmail.com> Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

> ℹ️ **Note** > > This PR body was truncated due to platform limits. This PR contains the following updates: | Package | Change | [Age](https://docs.renovatebot.com/merge-confidence/) | [Confidence](https://docs.renovatebot.com/merge-confidence/) | |---|---|---|---| | [accelerate](https://redirect.github.com/huggingface/accelerate) | `>=0.34.2,<1` → `>=1.13.0,<2` | ![age](https://developer.mend.io/api/mc/badges/age/pypi/accelerate/1.13.0?slim=true) | ![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/accelerate/0.34.2/1.13.0?slim=true) | --- ### Release Notes <details> <summary>huggingface/accelerate (accelerate)</summary> ### [`v1.13.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.13.0): : Neuron support, IPEX removal, and distributed training fixes [Compare Source](https://redirect.github.com/huggingface/accelerate/compare/v1.12.0...v1.13.0) #### AWS Neuron support We now have support for AWS Neuron (Trainium/Inferentia) devices. Thanks [@michaelbenayoun](https://redirect.github.com/michaelbenayoun) for adding this. - Neuron integration by [@michaelbenayoun](https://redirect.github.com/michaelbenayoun) in [#3935](https://redirect.github.com/huggingface/accelerate/pull/3935) ##### XPU Improvements We've removed IPEX dependency and improved device-agnostic code for XPU. - using spawn instead of fork for XPU device by [@kaixuanliu](https://redirect.github.com/kaixuanliu) in [#3884](https://redirect.github.com/huggingface/accelerate/pull/3884) - Remove ipex by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3883](https://redirect.github.com/huggingface/accelerate/pull/3883) - enhance new codes to XPU, and make them be device agnostic by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3890](https://redirect.github.com/huggingface/accelerate/pull/3890) - Fix KMP\_AFFINITY incorrectly set for non-CPU training by [@hexfaker](https://redirect.github.com/hexfaker) in [#3912](https://redirect.github.com/huggingface/accelerate/pull/3912) #### FSDP2 Improvements We've added a bunch of important fixes for FSDP2 users: upcasting only grad-requiring params, better tied embedding errors, DCP optimizer loading, bf16 optimizer step crash fix, and torch < 2.7.0 compatibility. - Upcast FSDP2 parameters only if requires\_grad by [@ojh31](https://redirect.github.com/ojh31) in [#3848](https://redirect.github.com/huggingface/accelerate/pull/3848) - Fix FSDP2 tied embedding errors with targeted ValueError guidance by [@amanzoni1](https://redirect.github.com/amanzoni1) in [#3878](https://redirect.github.com/huggingface/accelerate/pull/3878) - bug: fsdp cannot load optimizer state using dcp by [@flymin](https://redirect.github.com/flymin) in [#3904](https://redirect.github.com/huggingface/accelerate/pull/3904) - fix crash in optimizer.step when fsdp2 is enabled and model is bfloat16 by [@sywangyi](https://redirect.github.com/sywangyi) in [#3905](https://redirect.github.com/huggingface/accelerate/pull/3905) - Fix FSDP2 crash with ignored\_params on torch < 2.7.0 by [@Mr-Neutr0n](https://redirect.github.com/Mr-Neutr0n) in [#3924](https://redirect.github.com/huggingface/accelerate/pull/3924) #### DeepSpeed Sequence Parallelism We've added several fixes to the DeepSpeed + Sequence Parallelism integration introduced in v1.12.0, including evaluation support during SP training and proper process group handling. - \[SP] fix loss computation example by [@kashif](https://redirect.github.com/kashif) in [#3858](https://redirect.github.com/huggingface/accelerate/pull/3858) - \[SP and CP] error out if both CP and SP enabled by [@kashif](https://redirect.github.com/kashif) in [#3862](https://redirect.github.com/huggingface/accelerate/pull/3862) - DeepSpeed has its own process group by [@kashif](https://redirect.github.com/kashif) in [#3916](https://redirect.github.com/huggingface/accelerate/pull/3916) - \[Deepspeed] skip device mesh creation when deepspeed and sp\_size >1 by [@kashif](https://redirect.github.com/kashif) in [#3914](https://redirect.github.com/huggingface/accelerate/pull/3914) - Enable evaluation during deepspeed Sequence Parallel by [@jp1924](https://redirect.github.com/jp1924) in [#3917](https://redirect.github.com/huggingface/accelerate/pull/3917) ##### FP8 We've enhanced FP8 training. Thanks [@shimizust](https://redirect.github.com/shimizust) for fixing torchao support. - Fix FP8 torchao default config with padding and FSDP2 all-gather support by [@shimizust](https://redirect.github.com/shimizust) in [#3831](https://redirect.github.com/huggingface/accelerate/pull/3831) - Fix execution with Transformer Engine by [@ksivaman](https://redirect.github.com/ksivaman) in [#3852](https://redirect.github.com/huggingface/accelerate/pull/3852) - add MS-AMP deprecation warnings by [@neha222222](https://redirect.github.com/neha222222) in [#3857](https://redirect.github.com/huggingface/accelerate/pull/3857) ##### Performance Accelerate now imports faster by deferring heavy dependencies, and torch.compile hooks are disabled lazily. - Faster import by [@SunMarc](https://redirect.github.com/SunMarc) in [#3953](https://redirect.github.com/huggingface/accelerate/pull/3953) - lazy compile disable by [@SunMarc](https://redirect.github.com/SunMarc) in [#3947](https://redirect.github.com/huggingface/accelerate/pull/3947) - Disable hook compile by [@SunMarc](https://redirect.github.com/SunMarc) in [#3888](https://redirect.github.com/huggingface/accelerate/pull/3888) ##### Minor fixes - Allow non-Tensor values in a batch with dispatch\_batches=True by [@tomaarsen](https://redirect.github.com/tomaarsen) in [#3850](https://redirect.github.com/huggingface/accelerate/pull/3850) - fix module and optimizer parameter mismatch before prepare\_tp\_ by [@naomili0924](https://redirect.github.com/naomili0924) in [#3845](https://redirect.github.com/huggingface/accelerate/pull/3845) - Fix KeyError in extract\_model\_from\_parallel for partial torch.compile by [@amanzoni1](https://redirect.github.com/amanzoni1) in [#3881](https://redirect.github.com/huggingface/accelerate/pull/3881) - Fix hf\_device\_map device index comparison in prepare\_model by [@rezaqorbani](https://redirect.github.com/rezaqorbani) in [#3895](https://redirect.github.com/huggingface/accelerate/pull/3895) - Fix StatefulDataLoader KeyError with num\_workers > 0 by [@veeceey](https://redirect.github.com/veeceey) in [#3931](https://redirect.github.com/huggingface/accelerate/pull/3931) - Fix stateful dataloader DDP by [@SunMarc](https://redirect.github.com/SunMarc) in [#3952](https://redirect.github.com/huggingface/accelerate/pull/3952) - Fix: Remove duplicate W\&B initialization in offline mode by [@shantanugupta2004](https://redirect.github.com/shantanugupta2004) in [#3886](https://redirect.github.com/huggingface/accelerate/pull/3886) - Avoid using nvidia-smi on a CPU-only Colab instance by [@FlorianVal](https://redirect.github.com/FlorianVal) in [#3872](https://redirect.github.com/huggingface/accelerate/pull/3872) - Fix logging logic when in\_order is set to True by [@yuxinyuan](https://redirect.github.com/yuxinyuan) in [#3280](https://redirect.github.com/huggingface/accelerate/pull/3280) - Fix cpu offload check by [@SunMarc](https://redirect.github.com/SunMarc) in [#3946](https://redirect.github.com/huggingface/accelerate/pull/3946) - fix bug when both cpu\_ram\_efficient\_loading and cpu\_offload are enabled by [@kaixuanliu](https://redirect.github.com/kaixuanliu) in [#3910](https://redirect.github.com/huggingface/accelerate/pull/3910) - Fix async compatibility across python versions by [@SunMarc](https://redirect.github.com/SunMarc) in [#3901](https://redirect.github.com/huggingface/accelerate/pull/3901) - fix tp only bug by [@sywangyi](https://redirect.github.com/sywangyi) in [#3908](https://redirect.github.com/huggingface/accelerate/pull/3908) - fix parallelism\_config None error by [@jp1924](https://redirect.github.com/jp1924) in [#3927](https://redirect.github.com/huggingface/accelerate/pull/3927) - Np parall fix by [@sywangyi](https://redirect.github.com/sywangyi) in [#3900](https://redirect.github.com/huggingface/accelerate/pull/3900) - change the default value of fsdp\_min\_num\_params to int by [@CodeMan62](https://redirect.github.com/CodeMan62) in [#3902](https://redirect.github.com/huggingface/accelerate/pull/3902) - Fix mutable default in Megatron init and IndexError on empty ModuleList by [@jashshah999](https://redirect.github.com/jashshah999) in [#3944](https://redirect.github.com/huggingface/accelerate/pull/3944) - Prepare TP fix by [@michaelbenayoun](https://redirect.github.com/michaelbenayoun) in [#3945](https://redirect.github.com/huggingface/accelerate/pull/3945) - feat: added fine tuning example focused on TPUs by [@tengomucho](https://redirect.github.com/tengomucho) in [#3847](https://redirect.github.com/huggingface/accelerate/pull/3847) - Remove 8bit force hook for bnb by [@SunMarc](https://redirect.github.com/SunMarc) in [#3907](https://redirect.github.com/huggingface/accelerate/pull/3907) - docs: flag MS-AMP as deprecated in low-precision training guides by [@ManasVardhan](https://redirect.github.com/ManasVardhan) in [#3929](https://redirect.github.com/huggingface/accelerate/pull/3929) - fix: correct typo 'guarentee' to 'guarantee' by [@thecaptain789](https://redirect.github.com/thecaptain789) in [#3922](https://redirect.github.com/huggingface/accelerate/pull/3922) - Updating support of Megatron-LM by [@pengdurice](https://redirect.github.com/pengdurice) in [#3842](https://redirect.github.com/huggingface/accelerate/pull/3842) - Update support of Megatron-LM PR 2 by [@pengdurice](https://redirect.github.com/pengdurice) in [#3887](https://redirect.github.com/huggingface/accelerate/pull/3887) - Fix RNG state setting for HPU by [@michaelbenayoun](https://redirect.github.com/michaelbenayoun) in [#3936](https://redirect.github.com/huggingface/accelerate/pull/3936) - fix: load the HPU RNG state by [@michaelbenayoun](https://redirect.github.com/michaelbenayoun) in [#3937](https://redirect.github.com/huggingface/accelerate/pull/3937) ### [`v1.12.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.12.0): : Deepspeed Ulysses/ALST [Compare Source](https://redirect.github.com/huggingface/accelerate/compare/v1.11.0...v1.12.0) #### Deepspeed Ulysses/ALST integration Deepspeed Ulysses/ALST is an efficient way of training on long sequences by employing sequence parallelism and attention head parallelism. You can learn more about this technology in this paper <https://arxiv.org/abs/2506.13996> or this deepspeed tutorial <https://www.deepspeed.ai/tutorials/ulysses-alst-sequence-parallelism/>. <img width="2368" height="1250" alt="0d8bd9e0" src="https://github.com/user-attachments/assets/b94e90c9-4368-4711-ad57-58de3c714ebc" /> To enable Deepspeed Ulysses, you first need to create `ParallelismConfig` and setting `sp` related args: ```python parallelism_config = ParallelismConfig( sp_backend="deepspeed", sp_size=2, sp_handler=DeepSpeedSequenceParallelConfig(...), ) ``` Then, you need to make sure to compute the correct loss as described on our [docs](https://huggingface.co/docs/accelerate/main/en/concept_guides/sequence_parallelism) ```python ... losses_per_rank = torch.distributed.nn.functional.all_gather(loss, group=sp_group) good_tokens = (shift_labels != -100).view(-1).sum() good_tokens_per_rank = torch.distributed.nn.functional.all_gather(good_tokens, group=sp_group) total_loss = sum( losses_per_rank[rank] * good_tokens_per_rank[rank] for rank in range(sp_world_size) if good_tokens_per_rank[rank] > 0 ) total_good_tokens = sum(good_tokens_per_rank) loss = total_loss / max(total_good_tokens, 1) ``` Thanks [@S1ro1](https://redirect.github.com/S1ro1) for starting this work and for [@stas00](https://redirect.github.com/stas00) for finishing this work. Also thanks [@kashif](https://redirect.github.com/kashif) for adding docs and reviewing/testing this PR ! This feature will also be available in HF Trainer thanks for this PR from [@stas00](https://redirect.github.com/stas00): [huggingface/transformers#41832](https://redirect.github.com/huggingface/transformers/pull/41832) #### Minor changes - Remove warning for `cpu_ram_efficient_loading` by [@SunMarc](https://redirect.github.com/SunMarc) in [#3816](https://redirect.github.com/huggingface/accelerate/pull/3816) - update typo in bnb quantisation 4bit flag docstring by [@hbraith](https://redirect.github.com/hbraith) in [#3828](https://redirect.github.com/huggingface/accelerate/pull/3828) - ArXiv -> HF Papers by [@qgallouedec](https://redirect.github.com/qgallouedec) in [#3834](https://redirect.github.com/huggingface/accelerate/pull/3834) - Fix typo in broadcast\_object\_list docstring by [@wsntxxn](https://redirect.github.com/wsntxxn) in [#3823](https://redirect.github.com/huggingface/accelerate/pull/3823) - \[Bug] Update torch.optim.Optimizer parameter states after tensor parallelism by [@naomili0924](https://redirect.github.com/naomili0924) in [#3835](https://redirect.github.com/huggingface/accelerate/pull/3835) - use self hosted runner by [@SunMarc](https://redirect.github.com/SunMarc) in [#3841](https://redirect.github.com/huggingface/accelerate/pull/3841) - device type helper by [@kashif](https://redirect.github.com/kashif) in [#3843](https://redirect.github.com/huggingface/accelerate/pull/3843) #### New Contributors - [@hbraith](https://redirect.github.com/hbraith) made their first contribution in [#3828](https://redirect.github.com/huggingface/accelerate/pull/3828) - [@wsntxxn](https://redirect.github.com/wsntxxn) made their first contribution in [#3823](https://redirect.github.com/huggingface/accelerate/pull/3823) - [@naomili0924](https://redirect.github.com/naomili0924) made their first contribution in [#3835](https://redirect.github.com/huggingface/accelerate/pull/3835) **Full Changelog**: <https://github.com/huggingface/accelerate/compare/v1.11.0...v1.12.0> ### [`v1.11.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.11.0): : TE MXFP8, FP16/BF16 with MPS, Python 3.10 [Compare Source](https://redirect.github.com/huggingface/accelerate/compare/v1.10.1...v1.11.0) #### TE MXFP8 support We've added support for MXFP8 in our TransformerEngine integration. To use that, you need to set `use_mxfp8_block_scaling` in `fp8_config`. See nvidia docs \[here]. (<https://docs.nvidia.com/deeplearning/transformer-engine/user-guide/examples/fp8_primer.html#MXFP8-and-block-scaling>) - Add support for TE MXFP8 recipe in accelerate by [@pstjohn](https://redirect.github.com/pstjohn) in [#3688](https://redirect.github.com/huggingface/accelerate/pull/3688) #### FP16/BF16 Training for MPS devices BF16 and FP16 support for MPS devices is finally here. You can now pass `mixed_precision = "fp16" or "bf16"` when training on a mac (`fp16` requires torch 2.8 and `bf16` requires torch 2.6) - Add bf16/fp16 support for amp with mps device by [@SunMarc](https://redirect.github.com/SunMarc) in [#3373](https://redirect.github.com/huggingface/accelerate/pull/3373) #### FSDP updates The following PRs add respectively support to `ignored_params` and `no_sync()` for FSDPv2: - feat: add ignored\_params support for fsdp2 by [@kmehant](https://redirect.github.com/kmehant) in [#3731](https://redirect.github.com/huggingface/accelerate/pull/3731) - fix: model.set\_requires\_gradient\_sync(False) should be called to turn off gradient synchronization in FSDP2 by [@EquationWalker](https://redirect.github.com/EquationWalker) in [#3762](https://redirect.github.com/huggingface/accelerate/pull/3762) Mixed precision can now be passed as a dtype string from accelerate cli flag or `fsdp_config` in accelerate config file: - feat: allow mixed precision policy as dtype by [@kmehant](https://redirect.github.com/kmehant) in [#3751](https://redirect.github.com/huggingface/accelerate/pull/3751) #### Nd-parallel updates Some minor updates concerning nd-parallelism. - Context Parallelism docs typos fixed by [@sergiopaniego](https://redirect.github.com/sergiopaniego) in [#3761](https://redirect.github.com/huggingface/accelerate/pull/3761) - Feat: add to\_json by [@S1ro1](https://redirect.github.com/S1ro1) in [#3743](https://redirect.github.com/huggingface/accelerate/pull/3743) - make torch\_native\_parallelism examples device agnostic by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3759](https://redirect.github.com/huggingface/accelerate/pull/3759) - \[ND Parallel] Update examples, cleanup by [@S1ro1](https://redirect.github.com/S1ro1) in [#3737](https://redirect.github.com/huggingface/accelerate/pull/3737) #### Bump to Python 3.10 We've dropped support for python 3.9 as it reached EOL in October. - Bump to python3.10 + update linter by [@SunMarc](https://redirect.github.com/SunMarc) in [#3809](https://redirect.github.com/huggingface/accelerate/pull/3809) ##### Lots of minor fixes: - fix: CPU RAM efficient loading for nd or HSDP parallelisms by [@kmehant](https://redirect.github.com/kmehant) in [#3740](https://redirect.github.com/huggingface/accelerate/pull/3740) - xpu INT64 all\_gather issue fixed in 2.9 by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3756](https://redirect.github.com/huggingface/accelerate/pull/3756) - Specify device\_ids in torch.distributed.barrier for PartialState by [@qgallouedec](https://redirect.github.com/qgallouedec) in [#3744](https://redirect.github.com/huggingface/accelerate/pull/3744) - fix: specify device for process\_tensor in example usage by [@qgallouedec](https://redirect.github.com/qgallouedec) in [#3755](https://redirect.github.com/huggingface/accelerate/pull/3755) - Lower complexity of get\_balanced\_memory by adding a set by [@SamuelBarryCS](https://redirect.github.com/SamuelBarryCS) in [#3776](https://redirect.github.com/huggingface/accelerate/pull/3776) - Fix (skip) cuda cache flush when origin device is `cpu` and offloaded to `meta` by [@Qubitium](https://redirect.github.com/Qubitium) in [#3796](https://redirect.github.com/huggingface/accelerate/pull/3796) - Fix convert LayerNorm without bias to fp8 by [@mjun0812](https://redirect.github.com/mjun0812) in [#3725](https://redirect.github.com/huggingface/accelerate/pull/3725) - Add optional typing by [@cyyever](https://redirect.github.com/cyyever) in [#3769](https://redirect.github.com/huggingface/accelerate/pull/3769) - refactor: Use `with` in Accelerator.autocast()instead of ` __enter__()` and` __exit__()` for more elegant style. by [@EquationWalker](https://redirect.github.com/EquationWalker) in [#3767](https://redirect.github.com/huggingface/accelerate/pull/3767) - switch XPU ccl backend to torch-builtin xccl in test\_zero3\_integration by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3773](https://redirect.github.com/huggingface/accelerate/pull/3773) - fix FSDP2 test case failure on XPU by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3771](https://redirect.github.com/huggingface/accelerate/pull/3771) - Fix tests by [@SunMarc](https://redirect.github.com/SunMarc) in [#3722](https://redirect.github.com/huggingface/accelerate/pull/3722) - Protect import for device\_mesh by [@SunMarc](https://redirect.github.com/SunMarc) in [#3742](https://redirect.github.com/huggingface/accelerate/pull/3742) - Fix `SWANLAB_MODE` by [@SunMarc](https://redirect.github.com/SunMarc) in [#3808](https://redirect.github.com/huggingface/accelerate/pull/3808) - Fix tracking swanlab by [@SunMarc](https://redirect.github.com/SunMarc) in [#3810](https://redirect.github.com/huggingface/accelerate/pull/3810) - refactor: nit change for get\_parameters\_from\_modules (code debt) by [@kmehant](https://redirect.github.com/kmehant) in [#3815](https://redirect.github.com/huggingface/accelerate/pull/3815) - Remove deprecated FindTiedParametersResult by [@cyyever](https://redirect.github.com/cyyever) in [#3786](https://redirect.github.com/huggingface/accelerate/pull/3786) - Add optional typing by [@cyyever](https://redirect.github.com/cyyever) in [#3769](https://redirect.github.com/huggingface/accelerate/pull/3769) - remove mlflow from testing by [@SunMarc](https://redirect.github.com/SunMarc) in [#3783](https://redirect.github.com/huggingface/accelerate/pull/3783) - enable 2 model hook ut cases on XPU by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3774](https://redirect.github.com/huggingface/accelerate/pull/3774) - Added Tip for better rendering by [@sergiopaniego](https://redirect.github.com/sergiopaniego) in [#3781](https://redirect.github.com/huggingface/accelerate/pull/3781) - Fix typos by [@cyyever](https://redirect.github.com/cyyever) in [#3753](https://redirect.github.com/huggingface/accelerate/pull/3753) - fix: torch\_npu import error in some envs by [@yanyongyu](https://redirect.github.com/yanyongyu) in [#3764](https://redirect.github.com/huggingface/accelerate/pull/3764) - Fix: typo makes tests fail by [@S1ro1](https://redirect.github.com/S1ro1) in [#3765](https://redirect.github.com/huggingface/accelerate/pull/3765) - fix Muti node CUDA error: invalid device ordinal [#3775](https://redirect.github.com/huggingface/accelerate/issues/3775) by [@RicardoDominguez](https://redirect.github.com/RicardoDominguez) in [#3779](https://redirect.github.com/huggingface/accelerate/pull/3779) - use reset\_peak\_memory\_stats on xpu by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3772](https://redirect.github.com/huggingface/accelerate/pull/3772) #### New Contributors - [@mjun0812](https://redirect.github.com/mjun0812) made their first contribution in [#3725](https://redirect.github.com/huggingface/accelerate/pull/3725) - [@sergiopaniego](https://redirect.github.com/sergiopaniego) made their first contribution in [#3761](https://redirect.github.com/huggingface/accelerate/pull/3761) - [@EquationWalker](https://redirect.github.com/EquationWalker) made their first contribution in [#3762](https://redirect.github.com/huggingface/accelerate/pull/3762) - [@yanyongyu](https://redirect.github.com/yanyongyu) made their first contribution in [#3764](https://redirect.github.com/huggingface/accelerate/pull/3764) - [@RicardoDominguez](https://redirect.github.com/RicardoDominguez) made their first contribution in [#3779](https://redirect.github.com/huggingface/accelerate/pull/3779) - [@SamuelBarryCS](https://redirect.github.com/SamuelBarryCS) made their first contribution in [#3776](https://redirect.github.com/huggingface/accelerate/pull/3776) - [@Qubitium](https://redirect.github.com/Qubitium) made their first contribution in [#3796](https://redirect.github.com/huggingface/accelerate/pull/3796) **Full Changelog**: <https://github.com/huggingface/accelerate/compare/v1.10.1...v1.11.0> ### [`v1.10.1`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.10.1): : Patchfix [Compare Source](https://redirect.github.com/huggingface/accelerate/compare/v1.10.0...v1.10.1) - Feat: add to\_json by [@S1ro1](https://redirect.github.com/S1ro1) in [#3743](https://redirect.github.com/huggingface/accelerate/pull/3743) - Protect import for device\_mesh by [@SunMarc](https://redirect.github.com/SunMarc) in [#3742](https://redirect.github.com/huggingface/accelerate/pull/3742). **Full Changelog**: <https://github.com/huggingface/accelerate/compare/v1.10.0...v1.10.1> ### [`v1.10.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.10.0): : N-D Parallelism [Compare Source](https://redirect.github.com/huggingface/accelerate/compare/v1.9.0...v1.10.0) ### N-D Parallelism Training large models across multiple GPUs can be complex, especially when combining [different parallelism strategies](https://huggingface.co/spaces/nanotron/ultrascale-playbook) (e.g TP, CP, DP). To simplify this process, we've collaborated with [Axolotl](https://redirect.github.com/axolotl-ai-cloud/axolotl/) to introduce an easy-to-use integration that allows you to apply any combination of parallelism strategies directly in your training script. Just pass a `ParallelismConfig` specifying the size of each parallelism type—it's that simple. Learn more about how it works in our latest [blogpost](https://redirect.github.com/huggingface/blog/pull/3006). ```python parallelism_config = ParallelismConfig( dp_shard_size=2, dp_replicate_size=2, cp_size=2, tp_size=2, ) accelerator = Accelerator( parallelism_config=parallelism_config, ... ) model = AutoModelForCausalLM.from_pretrained("your-model-name", device_mesh=accelerator.torch_device_mesh) model = accelerator.prepare(model) ``` - Parallelism config + TP + HSDP + BYODM (Bring Your Own Device Mesh) by [@SalmanMohammadi](https://redirect.github.com/SalmanMohammadi) in [#3682](https://redirect.github.com/huggingface/accelerate/pull/3682) - Feat: context parallel v2.0 by [@S1ro1](https://redirect.github.com/S1ro1) in [#3700](https://redirect.github.com/huggingface/accelerate/pull/3700) - set default submesh\_tp\_size to prevent unset local variable error by [@winglian](https://redirect.github.com/winglian) in [#3687](https://redirect.github.com/huggingface/accelerate/pull/3687) - Add Parallelism getter property to Accelerator class by [@WoosungMyung](https://redirect.github.com/WoosungMyung) in [#3703](https://redirect.github.com/huggingface/accelerate/pull/3703) - Fix: prepare works even if nothing except tp specified (rare) by [@S1ro1](https://redirect.github.com/S1ro1) in [#3707](https://redirect.github.com/huggingface/accelerate/pull/3707) - Set parallelism\_config in constructor due to Trainer reset of State by [@winglian](https://redirect.github.com/winglian) in [#3713](https://redirect.github.com/huggingface/accelerate/pull/3713) - Fix: tp size wouldn't read from env by [@S1ro1](https://redirect.github.com/S1ro1) in [#3716](https://redirect.github.com/huggingface/accelerate/pull/3716) - Remove `ParallelismConfig` from `PartialState` by [@SunMarc](https://redirect.github.com/SunMarc) in [#3720](https://redirect.github.com/huggingface/accelerate/pull/3720) ### FSDP improvements We've fixed ignored modules attribute. With this, it is now possible to train PEFT model that moe layers that contrains `q_proj` and `v_proj` parameters. This is especially important for fine-tuning `gpt-oss` model. - ENH: Allow FSDP ignored modules to be regex by [@BenjaminBossan](https://redirect.github.com/BenjaminBossan) in [#3698](https://redirect.github.com/huggingface/accelerate/pull/3698) - TST Add test for FSDP ignored\_modules as str by [@BenjaminBossan](https://redirect.github.com/BenjaminBossan) in [#3719](https://redirect.github.com/huggingface/accelerate/pull/3719) ### Minor improvements - feature: CpuOffload pre\_forward don't attempt to move if already on device by [@JoeGaffney](https://redirect.github.com/JoeGaffney) in [#3695](https://redirect.github.com/huggingface/accelerate/pull/3695) - Fix: Ensure environment variable values are case-insensitive in Accelerate by [@jp1924](https://redirect.github.com/jp1924) in [#3712](https://redirect.github.com/huggingface/accelerate/pull/3712) - remove use\_ipex by [@SunMarc](https://redirect.github.com/SunMarc) in [#3721](https://redirect.github.com/huggingface/accelerate/pull/3721) ### New Contributors - [@SalmanMohammadi](https://redirect.github.com/SalmanMohammadi) made their first contribution in [#3682](https://redirect.github.com/huggingface/accelerate/pull/3682) - [@WoosungMyung](https://redirect.github.com/WoosungMyung) made their first contribution in [#3703](https://redirect.github.com/huggingface/accelerate/pull/3703) - [@jp1924](https://redirect.github.com/jp1924) made their first contribution in [#3712](https://redirect.github.com/huggingface/accelerate/pull/3712) - [@JoeGaffney](https://redirect.github.com/JoeGaffney) made their first contribution in [#3695](https://redirect.github.com/huggingface/accelerate/pull/3695) **Full Changelog**: <https://github.com/huggingface/accelerate/compare/v1.9.0...v1.10.0> ### [`v1.9.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.9.0): : Trackio support, Model loading speedup, Minor distributed improvements [Compare Source](https://redirect.github.com/huggingface/accelerate/compare/v1.8.1...v1.9.0) ### Trackio tracker support We've added support for a trackio, lightweight, 💯 free experiment tracking Python library built on top of 🤗 Datasets and Spaces. ![Screen Recording 2025-06-11 at 5 39 32 PM](https://redirect.github.com/user-attachments/assets/5cf12286-54e7-4119-8a20-88c2cbd37ab6) Main features are: - *Local-first* design: dashboard runs locally by default. You can also host it on Spaces by specifying a `space_id`. - Persists logs locally (or in a private Hugging Face Dataset) - Visualize experiments with a Gradio dashboard locally (or on Hugging Face Spaces) - Everything here, including hosting on Hugging Faces, is **free**! To use it with accelerate, you need to set `log_with` and initialize the trackers ```python accelerator = Accelerator(log_with="trackio") config={"learning_rate": 0.001, "batch_size": 32} # init_kwargs in order to host the dashboard on spaces init_kwargs = {"trackio": {"space_id": "hf_username/space_name"} accelerator.init_trackers("example_project", config=config, init_kwargs=init_kwargs}) ``` Thanks [@pcuenca](https://redirect.github.com/pcuenca) for the integration ! - trackio by [@pcuenca](https://redirect.github.com/pcuenca) in [#3669](https://redirect.github.com/huggingface/accelerate/pull/3669) #### Model loading speedup when relying `set_module_tensor_to_device ` Setting tensor while clearing cache is very slow, so we added `clear_device` option to disable it. Another small optimization is using `non_blocking` everywhere and syncing just before returning control to the user. This makes the loading slightly faster. - Speedup model loading by 4-5x in Diffusers ⚡ by [@a-r-r-o-w](https://redirect.github.com/a-r-r-o-w) in [#3674](https://redirect.github.com/huggingface/accelerate/pull/3674) #### FDSP, Deepspeed, FP8 minor improvements - Add support for e5e2 and default to hybrid when launcher is used by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3640](https://redirect.github.com/huggingface/accelerate/pull/3640) - Fix FP8 tests, enable FP8 to be used without direct `Accelerator()` configuring by [@pstjohn](https://redirect.github.com/pstjohn) in [#3677](https://redirect.github.com/huggingface/accelerate/pull/3677) - Bunch of FSDP improvements by [@S1ro1](https://redirect.github.com/S1ro1) in [#3671](https://redirect.github.com/huggingface/accelerate/pull/3671) - Fix: properly error when DDP + Dtensor model by [@S1ro1](https://redirect.github.com/S1ro1) in [#3629](https://redirect.github.com/huggingface/accelerate/pull/3629) - Fix fsdp2 example typo by [@shimizust](https://redirect.github.com/shimizust) in [#3657](https://redirect.github.com/huggingface/accelerate/pull/3657) - Added a check in no\_sync() to avoid errors when using deepspeed zero2/3 by [@xliu0105](https://redirect.github.com/xliu0105) in [#3656](https://redirect.github.com/huggingface/accelerate/pull/3656) #### 🚨🚨🚨 Breaking changes 🚨🚨🚨 `find_executable_batch_size()` will no longer halves the batch after every OOM. Instead, we will multiply the batch size by 0.9. This should help user not waste gpu capacity. - “Stop Halving My Batch!” · Default back-off 0.5 → 0.9 by [@SunMarc](https://redirect.github.com/SunMarc) in [#3684](https://redirect.github.com/huggingface/accelerate/pull/3684) #### What's Changed - \[typo] shards instead of shard by [@SunMarc](https://redirect.github.com/SunMarc) in [#3645](https://redirect.github.com/huggingface/accelerate/pull/3645) - Docs: Fix typos in gradient accumulation guide by [@kilavvy](https://redirect.github.com/kilavvy) in [#3649](https://redirect.github.com/huggingface/accelerate/pull/3649) - xpu enablement on left cases by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3654](https://redirect.github.com/huggingface/accelerate/pull/3654) - unpin datasets in examples requirements by [@SunMarc](https://redirect.github.com/SunMarc) in [#3681](https://redirect.github.com/huggingface/accelerate/pull/3681) - fix: wandb config not saved in offline mode by [@ved1beta](https://redirect.github.com/ved1beta) in [#3648](https://redirect.github.com/huggingface/accelerate/pull/3648) - accelerate/data\_loader.py: do not yield if the base\_dataloader is empty by [@0xnightwind](https://redirect.github.com/0xnightwind) in [#3659](https://redirect.github.com/huggingface/accelerate/pull/3659) - warn for invalid keys by [@ved1beta](https://redirect.github.com/ved1beta) in [#3613](https://redirect.github.com/huggingface/accelerate/pull/3613) - Update Gaudi runner image to latest SynapseAI and enable previously disabled tests by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3653](https://redirect.github.com/huggingface/accelerate/pull/3653) #### New Contributors - [@kilavvy](https://redirect.github.com/kilavvy) made their first contribution in [#3649](https://redirect.github.com/huggingface/accelerate/pull/3649) - [@shimizust](https://redirect.github.com/shimizust) made their first contribution in [#3657](https://redirect.github.com/huggingface/accelerate/pull/3657) - [@xliu0105](https://redirect.github.com/xliu0105) made their first contribution in [#3656](https://redirect.github.com/huggingface/accelerate/pull/3656) - [@0xnightwind](https://redirect.github.com/0xnightwind) made their first contribution in [#3659](https://redirect.github.com/huggingface/accelerate/pull/3659) **Full Changelog**: <https://github.com/huggingface/accelerate/compare/v1.8.1...v1.9.0> ### [`v1.8.1`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.8.1): : Patchfix [Compare Source](https://redirect.github.com/huggingface/accelerate/compare/v1.8.0...v1.8.1) - Add support for e5e2 and default to hybrid when launcher is used by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3640](https://redirect.github.com/huggingface/accelerate/pull/3640) - shards by [@SunMarc](https://redirect.github.com/SunMarc) in [#3645](https://redirect.github.com/huggingface/accelerate/pull/3645) **Full Changelog**: <https://github.com/huggingface/accelerate/compare/v1.8.0...v1.8.1> ### [`v1.8.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.8.0): : FSDPv2 + FP8, Regional Compilation for DeepSpeed, Faster Distributed Training on Intel CPUs, ipex.optimize deprecation [Compare Source](https://redirect.github.com/huggingface/accelerate/compare/v1.7.0...v1.8.0) ### FSDPv2 refactor + FP8 support We've simplified how to prepare FSDPv2 models, as there were too many ways to compose FSDP2 with other features (e.g., FP8, torch.compile, activation checkpointing, etc.). Although the setup is now more restrictive, it leads to fewer errors and a more performant user experience. We’ve also added support for FP8. You can read about the results [here](https://redirect.github.com/huggingface/accelerate/tree/main/examples/fsdp2). Thanks to [@S1ro1](https://redirect.github.com/S1ro1) for this contribution! - \[FSDP2] Refactor + FP8 by [@S1ro1](https://redirect.github.com/S1ro1) in [#3585](https://redirect.github.com/huggingface/accelerate/pull/3585) ### Faster Distributed Training on Intel CPUs We updated the `CCL_WORKER_COUNT` variable and added `KMP` parameters for Intel CPU users. This significantly improves distributed training performance (e.g., Tensor Parallelism), with up to a 40% speed-up on Intel 4th Gen Xeon when training transformer TP models. - Set ccl and KMP param in simple launch by [@jiqing-feng](https://redirect.github.com/jiqing-feng) in [#3575](https://redirect.github.com/huggingface/accelerate/pull/3575) ### Regional Compilation for DeepSpeed We added support for regional compilation with the DeepSpeed engine. DeepSpeed’s .compile() modifies models in-place using torch.nn.Module.compile(...), rather than the out-of-place torch.compile(...), so we had to account for that. Thanks [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) for this feature! - Fix deepspeed regional compilation by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3609](https://redirect.github.com/huggingface/accelerate/pull/3609) ### ipex.optimize deprecation `ipex.optimize` is being deprecated. Most optimizations have been upstreamed to PyTorch, and future improvements will land there directly. For users without PyTorch 2.8, we’ll continue to rely on IPEX for now. - remove ipex.optimize in accelerate by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3608](https://redirect.github.com/huggingface/accelerate/pull/3608) ### Better XPU Support We've greatly expanded and stabilized support for Intel XPUs: - enable fsdp2 benchmark on XPU by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3590](https://redirect.github.com/huggingface/accelerate/pull/3590) - enable big\_model\_inference on xpu by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3595](https://redirect.github.com/huggingface/accelerate/pull/3595) - enable test\_load\_checkpoint\_and\_dispatch\_with\_broadcast cases on XPU by [@yao-matrix](https://redirect.github.com/yao-matrix) in - enable test\_cli & test\_example cases on XPU by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3578](https://redirect.github.com/huggingface/accelerate/pull/3578) - enable torchao and pippy test cases on XPU by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3599](https://redirect.github.com/huggingface/accelerate/pull/3599) - enable regional\_compilation benchmark on xpu by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3592](https://redirect.github.com/huggingface/accelerate/pull/3592) - fix xpu 8bit value loading by [@jiqing-feng](https://redirect.github.com/jiqing-feng) in [#3623](https://redirect.github.com/huggingface/accelerate/pull/3623) - add device-agnostic GradScaler by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3588](https://redirect.github.com/huggingface/accelerate/pull/3588) - add xpu support in TorchTensorParallelPlugin by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3627](https://redirect.github.com/huggingface/accelerate/pull/3627) ### Trackers We've added support for [SwanLab](https://redirect.github.com/SwanHubX/SwanLab) as an experiment tracking backend. Huge thanks to [@ShaohonChen](https://redirect.github.com/ShaohonChen) for this contribution ! We also deferred all tracker initializations to prevent premature setup of distributed environments. - Integrate SwanLab for offline/online experiment tracking for Accelerate by [@ShaohonChen](https://redirect.github.com/ShaohonChen) in [#3605](https://redirect.github.com/huggingface/accelerate/pull/3605) - Fix: Defer Tracker Initialization to Prevent Premature Distributed Setup by [@yuanjua](https://redirect.github.com/yuanjua) in [#3581](https://redirect.github.com/huggingface/accelerate/pull/3581) #### What's Changed - Fix bf16 training with TP by [@SunMarc](https://redirect.github.com/SunMarc) in [#3610](https://redirect.github.com/huggingface/accelerate/pull/3610) - better handle FP8 with and without deepspeed by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3611](https://redirect.github.com/huggingface/accelerate/pull/3611) - Update Gaudi Runners by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3593](https://redirect.github.com/huggingface/accelerate/pull/3593) - goodbye torch\_ccl by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3580](https://redirect.github.com/huggingface/accelerate/pull/3580) - Add support for standalone mode when default port is occupied on single node by [@laitifranz](https://redirect.github.com/laitifranz) in [#3576](https://redirect.github.com/huggingface/accelerate/pull/3576) - Resolve logger warnings by [@emmanuel-ferdman](https://redirect.github.com/emmanuel-ferdman) in [#3582](https://redirect.github.com/huggingface/accelerate/pull/3582) - Add kwargs to optimizer, scheduler and dataloader using function `accelerator().load_state()` by [@luiz0992](https://redirect.github.com/luiz0992) in [#3540](https://redirect.github.com/huggingface/accelerate/pull/3540) - \[docs] no hard-coded cuda in the ddp documentation by [@faaany](https://redirect.github.com/faaany) in [#3589](https://redirect.github.com/huggingface/accelerate/pull/3589) - change to use torch.device by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3594](https://redirect.github.com/huggingface/accelerate/pull/3594) - Fix: list object has no attribute keys by [@S1ro1](https://redirect.github.com/S1ro1) in [#3603](https://redirect.github.com/huggingface/accelerate/pull/3603) - Update Gaudi Runners by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3593](https://redirect.github.com/huggingface/accelerate/pull/3593) - Fix bf16 training with TP by [@SunMarc](https://redirect.github.com/SunMarc) in [#3610](https://redirect.github.com/huggingface/accelerate/pull/3610) - better handle FP8 with and without deepspeed by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3611](https://redirect.github.com/huggingface/accelerate/pull/3611) - Remove device\_count for TPU launcher to avoid initializing runtime by [@sorgfresser](https://redirect.github.com/sorgfresser) in [#3587](https://redirect.github.com/huggingface/accelerate/pull/3587) - Fix missing te.LayerNorm in intel\_transformer\_engine by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3619](https://redirect.github.com/huggingface/accelerate/pull/3619) - Add fp8\_e5m2 support in `dtype_byte_size` by [@SunMarc](https://redirect.github.com/SunMarc) in [#3625](https://redirect.github.com/huggingface/accelerate/pull/3625) - \[Deepspeed] deepspeed auto grad accum by [@kashif](https://redirect.github.com/kashif) in [#3630](https://redirect.github.com/huggingface/accelerate/pull/3630) - Remove hardcoded cuda from fsdpv2 by [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) in [#3631](https://redirect.github.com/huggingface/accelerate/pull/3631) - Integrate SwanLab for offline/online experiment tracking for Accelerate by [@ShaohonChen](https://redirect.github.com/ShaohonChen) in [#3605](https://redirect.github.com/huggingface/accelerate/pull/3605) - Fix Typos in Documentation and Comments by [@leopardracer](https://redirect.github.com/leopardracer) in [#3621](https://redirect.github.com/huggingface/accelerate/pull/3621) - feat: use datasets.IterableDataset shard if possible by [@SunMarc](https://redirect.github.com/SunMarc) in [#3635](https://redirect.github.com/huggingface/accelerate/pull/3635) - \[DeepSpeed] sync gradient accum steps from deepspeed plugin by [@kashif](https://redirect.github.com/kashif) in [#3632](https://redirect.github.com/huggingface/accelerate/pull/3632) - Feat: add cpu offload by [@S1ro1](https://redirect.github.com/S1ro1) in [#3636](https://redirect.github.com/huggingface/accelerate/pull/3636) - Fix: correct labels for fsdp2 examples by [@S1ro1](https://redirect.github.com/S1ro1) in [#3637](https://redirect.github.com/huggingface/accelerate/pull/3637) - fix grad acc deepspeed by [@SunMarc](https://redirect.github.com/SunMarc) in [#3638](https://redirect.github.com/huggingface/accelerate/pull/3638) #### New Contributors - [@laitifranz](https://redirect.github.com/laitifranz) made their first contribution in [#3576](https://redirect.github.com/huggingface/accelerate/pull/3576) - [@emmanuel-ferdman](https://redirect.github.com/emmanuel-ferdman) made their first contribution in [#3582](https://redirect.github.com/huggingface/accelerate/pull/3582) - [@yuanjua](https://redirect.github.com/yuanjua) made their first contribution in [#3581](https://redirect.github.com/huggingface/accelerate/pull/3581) - [@sorgfresser](https://redirect.github.com/sorgfresser) made their first contribution in [#3587](https://redirect.github.com/huggingface/accelerate/pull/3587) - [@ShaohonChen](https://redirect.github.com/ShaohonChen) made their first contribution in [#3605](https://redirect.github.com/huggingface/accelerate/pull/3605) - [@leopardracer](https://redirect.github.com/leopardracer) made their first contribution in [#3621](https://redirect.github.com/huggingface/accelerate/pull/3621) **Full Changelog**: <https://github.com/huggingface/accelerate/compare/v1.7.0...v1.8.0> ### [`v1.7.0`](https://redirect.github.com/huggingface/accelerate/releases/tag/v1.7.0): : Regional compilation, Layerwise casting hook, FSDPv2 + QLoRA [Compare Source](https://redirect.github.com/huggingface/accelerate/compare/v1.6.0...v1.7.0) ### Regional compilation Instead of compiling the entire model at once, regional compilation targets repeated blocks (such as decoder layers) first. This allows the compiler to cache and reuse optimized code for subsequent blocks, significantly reducing the cold start compilation time typically seen during the first inference. Thanks [@IlyasMoutawwakil](https://redirect.github.com/IlyasMoutawwakil) for the feature ! You can view the full benchmark [here](https://redirect.github.com/huggingface/accelerate/tree/main/benchmarks/torch.compile), and check out our updated [compilation guide](https://huggingface.co/docs/accelerate/en/usage_guides/compilation) for more details! ![compilation\_time-1](https://redirect.github.com/user-attachments/assets/38795d12-6ee7-4a10-84c6-d29a0877e36c) To enable this feature, set `use_regional_compilation=True` in the `TorchDynamoPlugin` configuration. ```python # Configure the compilation backend dynamo_plugin = TorchDynamoPlugin( use_regional_compilation=True, ... # other parameters ) # Initialize accelerator with the plugin accelerator = Accelerator(dynamo_plugin=dynamo_plugin) # This will apply compile_regions to your model model = accelerator.prepare(model) ``` ### Layerwise casting hook We've introduced a new hook that enables per-layer upcasting and downcasting (e.g., for Linear layers) during inference. This allows users to run models with separate storage and compute dtypes, resulting in memory savings. The concept was first implemented in [diffusers](https://huggingface.co/docs/diffusers/main/en/optimization/memory#layerwise-casting), where downcasting models to FP8 proved effective without major quality degradation. Contributed by [@sayakpaul](https://redirect.github.com/sayakpaul) in [#3427](https://redirect.github.com/huggingface/accelerate/pull/3427) ```python model = .... storage_dtype = torch.float8_e4m3fn compute_dtype = torch.bfloat16 attach_layerwise_casting_hooks( model, storage_dtype=storage_dtype, compute_dtype=compute_dtype, ) ``` ### Better FSDP2 support This release includes numerous new features and bug fixes. Notably, we’ve added support for `FULL_STATE_DICT`, a widely used option in FSDP, now enabling `.save_pretrained()` in transformers to work with FSDP2 wrapped models. QLoRA training is now supported as well but more testing is needed. We have also resolved a backend issue related to parameter offloading to CPU. Additionally, a significant memory spike that occurred when `cpu_ram_efficient_loading=True` was enabled has been fixed. Several other minor improvements and fixes are also included—see the **What’s Changed** section for full details. - `FULL_STATE_DICT` have been enabled by [@S1ro1](https://redirect.github.com/S1ro1) in [#3527](https://redirect.github.com/huggingface/accelerate/pull/3527) - QLoRA support by [@winglian](https://redirect.github.com/winglian) in [#3546](https://redirect.github.com/huggingface/accelerate/pull/3546) - set backend correctly for CUDA+FSDP2+cpu-offload in [#3574](https://redirect.github.com/huggingface/accelerate/pull/3574) - memory spike fixed when using `cpu_ram_efficient_loading=True` by [@S1ro1](https://redirect.github.com/S1ro1) in [#3482](https://redirect.github.com/huggingface/accelerate/pull/3482) ### Better HPU support: We have added a [documentation](https://huggingface.co/docs/accelerate/en/usage_guides/gaudi) for Intel Gaudi hardware ! The support is already available since v1.5.0 through this [PR](https://redirect.github.com/huggingface/accelerate/pull/3378). - Add the HPU into accelerate config by [@yuanwu2017](https://redirect.github.com/yuanwu2017) in [#3495](https://redirect.github.com/huggingface/accelerate/pull/3495) - Add Gaudi doc by [@regisss](https://redirect.github.com/regisss) in [#3537](https://redirect.github.com/huggingface/accelerate/pull/3537) ### Torch.compile breaking change for `dynamic` argument We've updated the logic for setting `self.dynamic` to explicitly preserve None rather than defaulting to `False` when the `USE_DYNAMIC` environment variable is unset. This change aligns the behavior with the PyTorch documentation for [torch.compile](https://docs.pytorch.org/stable/generated/torch.compile.html). Thanks to [@yafshar](https://redirect.github.com/yafshar) for contributing this improvement in [#3567](https://redirect.github.com/huggingface/accelerate/pull/3567). #### What's Changed - use device agnostic torch.OutOfMemoryError from pytorch 2.5.0 by [@yao-matrix](https://redirect.github.com/yao-matrix) in [#3475](https://redirect.github.com/huggingface/accelerate/pull/3475) - Adds style bot by [@zach-huggingface](https://redirect.github.com/zach-huggingface) in [#3478](https://redirect.github.com/huggingface/accelerate/pull/3478) - Fix a tiny typo in `low_precision_training` guide by [@sadra-barikbin](https://redirect.github.com/sadra-barikbin) in [#3488](https://redirect.github.com/huggingface/accelerate/pull/3488) - Fix check\_tied\_parameters\_in\_config for multimodal models by [@SunMarc](https://redirect.github.com/SunMarc) in [#3479](https://redirect.github.com/huggingface/accelerate/pull/3479) - Don't create new param for TorchAO sequential offloading due to weak BC guarantees by [@a-r-r-o-w](https://redirect.github.com/a-r-r-o-w) in [#3444](https://redirect.github.com/huggingface/accelerate/pull/3444) - add support for custom function for reducing the batch size by [@winglian](https://redirect.github.com/winglian) in [#3071](https://redirect.github.com/huggingface/accelerate/pull/3071) - Fix fp8 deepspeed config by [@SunMarc](https://redirect.github.com/SunMarc) in [#3492](https://redirect.github.com/huggingface/accelerate/pull/3492) - fix warning error by [@faaany](https://redirect.github.com/faaany) in [#3491](https://redirect.github.com/huggingface/accelerate/pull/3491) - \[bug] unsafe\_serialization option in "merge-weights" doesn't work by [@cyr0930](https://redirect.github.com/cyr0930) in [#3496](https://redirect.github.com/huggingface/accelerate/pull/3496) - Add the HPU into accelerate config by [@yuanwu2017](https://redirect.github.com/yuanwu2017) in [#3495](https://redirect.github.com/huggingface/accelerate/pull/3495) - Use `torch.distributed.checkpoint.state_dict.set_model_state_dict` in `load_checkpoint_in_model` by [@ringohoffman](https://redirect.github.com/ringohoffman) in [#3432](https://redirect.github.com/huggingface/accelerate/pull/3432) - nit: needed sanity checks for fsdp2 by [@kmehant](https://redirect.github.com/kmehant) in [#3499](https://redirect.github.com/huggingface/accelerate/pull/3499) - (Part 1) fix: make TP training compatible with new transformers by [@kmehant](https://redirect.github.com/kmehant) in [#3457](https://redirect.github.com/huggingface/accelerate/pull/3457) - Fix deepspeed tests by [@S1ro1](https://redirect.github.com/S1ro1) in [#3503](https://redirect.github.com/huggingface/accelerate/pull/3503) - Add FP8 runners + tweak building FP8 image by [@zach-huggingface](https://redirect.github.com/zach-huggingface) in [#3493](https://redirect.github.com/huggingface/accelerate/pull/3493) - fix: apply torchfix to set `weights_only=True` by [@bzhong-solink](https://redirect.github.com/bzhong-solink) in [#3497](https://redirect.github.com/huggingface/accelerate/pull/3497) - Fix: require transformers version for tp tests by [@S1ro1](https://redirect.github.com/S1ro1) in [#3504](https://redirect.github.com/huggingface/accelerate/pull/3504) - Remove deprecated PyTorch/XLA APIs by [@zpcore](https://redirect.github.com/zpcore) in [#3484](https://redirect.github.com/huggingface/accelerate/pull/3484) - Fix cache issue by upgrading github actions version by [@SunMarc](https://redirect.github.com/SunMarc) in [#3513](https://redirect.github.com/huggingface/accelerate/pull/3513) - \[Feat] Layerwise casting hook by [@sayakpaul](https://redirect.github.com/sayakpaul) in [#3427](https://redirect.github.com/huggingface/accelerate/pull/3427) - Add torchao to FP8 error message by [@jphme](https://redirect.github.com/jphme) in [#3514](https://redirect.github.com/huggingface/accelerate/pull/3514) - Fix unwanted cuda init due to torchao by [@SunMarc](https://redirect.github.com/SunMarc) in [#3530](https://redirect.github.com/huggingface/accelerate/pull/3530) - Solve link error in internal\_mechanism documentation ([#3506](https://redirect.github.com/huggingface/accelerate/issues/3506)) by [@alvaro-mazcu](https://redirect.github.com/alvaro-mazcu) in [#3507](https://redirect.github.com/huggingface/accelerate/pull/3507) - \[FSDP2] Enable FULL\_STATE\_DICT by [@S1ro1](https://redirect.github.com/S1ro1) in [#3527](https://redirect.github.com/huggingface/accelerate/pull/3527) - \[FSDP2] Fix memory spike with `cpu_ram_efficient_loading=True` by [@S1ro1](https://redirect.github.com/S1ro1) in [#3482](https://redirect.github.com/huggingface/accelerate/pull/3482) - \[FSDP2] Issues in Wrap Policy and Mixed Precision by [@jhliu17](https://redirect.github.com/jhliu17) in [#3528](https://redire > ✂ **Note** > > PR body was truncated to here. </details> --- ### Configuration 📅 **Schedule**: (UTC) - Branch creation - At any time (no schedule defined) - Automerge - At any time (no schedule defined) 🚦 **Automerge**: Enabled. ♻ **Rebasing**: Whenever PR is behind base branch, or you tick the rebase/retry checkbox. 🔕 **Ignore**: Close this PR and you won't be reminded about this update again. --- - [ ] If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://redirect.github.com/renovatebot/renovate).  Co-authored-by: aar-public-version-bump-bot[bot] <286693160+aar-public-version-bump-bot[bot]@users.noreply.github.com>

HF Trainer: ALST/Ulysses sequence parallelism integration via HF Acce…

cfee9c9

…lerate Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

This was referenced Oct 23, 2025

Deepspeed Ulysses/ALST integration huggingface/accelerate#3817

Merged

Add support for context parallelism #35983

Closed

stas00 mentioned this pull request Oct 26, 2025

[tests] Add Context-parallel CI tests #41860

Merged

5 tasks

stas00 marked this pull request as draft October 27, 2025 04:42

make it work + tests

6e28ca8

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

stas00 commented Oct 28, 2025

View reviewed changes

Comment thread src/transformers/testing_utils.py

stas00 commented Oct 28, 2025

View reviewed changes

cleanup

86a09b9

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

stas00 marked this pull request as ready for review October 28, 2025 02:34

Merge branch 'main' into alst-integration

bb902f9

undo

c0e8e0d

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

kashif mentioned this pull request Nov 1, 2025

[ALST/Ulysses] Added ALST/Ulysses documentation huggingface/trl#4420

Merged

9 tasks

SunMarc reviewed Nov 4, 2025

View reviewed changes

sfc-gh-sbekman and others added 6 commits November 5, 2025 03:38

normalize

101eaff

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

always return cp_size

d8770d5

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

cleanup

4f416a4

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

extract code into _deepspeed_cp_compute_loss

ce5e392

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

fix

3ceaa94

Signed-off-by: Stas Bekman <stas.bekman@snowflake.com>

Merge branch 'main' into alst-integration

607e166

kashif reviewed Nov 19, 2025

View reviewed changes

Comment thread src/transformers/trainer.py Outdated

Apply suggestions from code review

a05eb52

kashif approved these changes Nov 19, 2025

View reviewed changes

Merge branch 'main' into alst-integration

ad61079

kashif reviewed Nov 20, 2025

View reviewed changes

Comment thread docs/source/en/deepspeed.md

Apply suggestion from @kashif

3fd097d

kashif reviewed Nov 20, 2025

View reviewed changes

Comment thread docs/source/en/deepspeed.md

Apply suggestion from @kashif

e3d8eda

kashif reviewed Nov 20, 2025

View reviewed changes

Comment thread docs/source/en/deepspeed.md Outdated

Apply suggestion from @kashif

ef59f3e

SunMarc reviewed Nov 20, 2025

View reviewed changes

Comment thread src/transformers/training_args.py Outdated

SunMarc reviewed Nov 20, 2025

View reviewed changes

Comment thread src/transformers/trainer.py Outdated

kashif and others added 3 commits November 20, 2025 17:28

Update src/transformers/trainer.py

59487a8

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Update src/transformers/training_args.py

2444728

Co-authored-by: Marc Sun <57196510+SunMarc@users.noreply.github.com>

Merge branch 'main' into alst-integration

4f33c2f

kashif reviewed Nov 21, 2025

View reviewed changes

Comment thread src/transformers/trainer.py Outdated

Apply suggestion from @kashif

7d09b28

kashif reviewed Nov 21, 2025

View reviewed changes

Comment thread docs/source/en/deepspeed.md Outdated

Apply suggestion from @kashif

2e52913

ArthurZucker approved these changes Nov 21, 2025

View reviewed changes

SunMarc approved these changes Nov 21, 2025

View reviewed changes

Merge branch 'main' into alst-integration

7a5c45e

ArthurZucker merged commit 7e0ea69 into huggingface:main Nov 21, 2025
15 of 21 checks passed

stas00 deleted the alst-integration branch November 21, 2025 17:39

Uh oh!

Conversation

stas00 commented Oct 23, 2025 • edited by kashif Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Oct 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

stas00 commented Oct 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Rocketknight1 commented Oct 24, 2025

Uh oh!

Uh oh!

stas00 Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

stas00 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

stas00 commented Oct 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

SunMarc left a comment

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

SunMarc Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

stas00 Nov 4, 2025

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

stas00 Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

SunMarc Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

stas00 Nov 5, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

kashif commented Nov 4, 2025

Uh oh!

SunMarc commented Nov 4, 2025

Uh oh!

HuggingFaceDocBuilderDev commented Nov 4, 2025

Uh oh!

kashif commented Nov 4, 2025

Uh oh!

kashif commented Nov 4, 2025

Uh oh!

stas00 commented Nov 4, 2025

Uh oh!

Uh oh!

zhangwj618 commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

kashif commented Nov 20, 2025

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

stas00 commented Oct 23, 2025 •

edited by kashif

Loading

stas00 commented Oct 23, 2025 •

edited

Loading

stas00 commented Oct 24, 2025 •

edited

Loading

stas00 Oct 28, 2025 •

edited

Loading

stas00 commented Oct 28, 2025 •

edited

Loading

SunMarc Nov 4, 2025 •

edited

Loading

stas00 Nov 4, 2025 •

edited

Loading

zhangwj618 commented Nov 20, 2025 •

edited

Loading

stas00 commented Nov 21, 2025 •

edited

Loading