Releases: vipshop/cache-dit
v1.1.7
v1.1.6
v1.1.5 🔥HunyuanVideo-1.5/Ovis-Image
What's Changed
- Add profiler for flux tp and cp example by @BBuf in #501
- chore: Update README.md by @DefTruth in #502
- feat: support FnB0 for z-image w/ cp by @DefTruth in #503
- feat: support _sdpa_cudnn backend for cp by @DefTruth in #504
- feat: support async ulysses cp for z-image by @DefTruth in #505
- feat: add all_to_all_single v2 by @DefTruth in #507
- feat: support async ulysses cp for qwen-image by @DefTruth in #508
- feat: support all2all qkv per token fp8 by @triple-Mu in #509
- chore: improve flux2 and qwen image examples by @BBuf in #512
- fix: workaround for uaa-fp8 .view compile error by @triple-Mu in #514
- feat: relaxed transformer strict assert by @DefTruth in #515
- feat: all2all qkv fp8 for ulysses by @DefTruth in #516
- feat: support pre-defined step masks by @DefTruth in #517
- chore: separate chrono-edit and wan cp plan by @DefTruth in #519
- fix example utils.py uaa fp8 flag typo by @DefTruth in #521
- feat: extend predefined step masks for 4/6 steps by @DefTruth in #523
- misc: add z-image-turbo predefined step masks by @DefTruth in #525
- feat: support per_token_quant_fp8 triton kernel by @triple-Mu in #524
- feat: unified async ulysses fp8 by @DefTruth in #526
- feat: support serving for cache-dit by @BBuf in #522
- Fix get_model_info api 404 when serving with tp/cp by @BBuf in #529
- feat: support cache for hunyuanvideo-1.5 by @DefTruth in #528
- feat: support cache for ovis-image by @DefTruth in #530
New Contributors
- @triple-Mu made their first contribution in #509
Full Changelog: v1.1.4...v1.1.5
v1.1.4 🔥FLUX.2/Z-Image
What's Changed
- feat: support torch profiler in cache-dit by @BBuf in #491
- feat: support 🔥z-image tensor parallel by @gameofdimension in #494
- feat: support lumina2 tensor parallel by @gameofdimension in #495
- feat: support cache for 🔥z-image by @DefTruth in #496
- feat: support context parallel for 🔥z-image by @DefTruth in #497
- fix: temp FnB(n>0) workaround for z-image cache w/ cp by @DefTruth in #499
Full Changelog: v1.1.3...v1.1.4
v1.1.3 🔥FLUX.2
What's Changed
- chore: Add wan 2.2 i2v context parallel example by @DefTruth in #476
- chore: optimize wan examples, compile & offload by @BBuf in #477
- feat: support async ulysses cp for flux by @DefTruth in #480
- chore: update support matrix by @DefTruth in #484
- chore: update async ulysses cp docs by @DefTruth in #486
- chore: update async ulysses cp refs by @DefTruth in #487
- feat: support FLUX.2-dev Tensor Parallelism by @gameofdimension in #485
- feat: support Hybrid cache + TP for 🔥FLUX.2 by @DefTruth in #489
- feat: Add seq offload for 🔥FLUX.2 w/o parallel by @DefTruth in #490
- feat: support 🔥FLUX.2 context parallel by @DefTruth in #492
Full Changelog: v1.1.2...v1.1.3
v1.1.2 UAA & SkyReelsV2 TP/CP
What's Changed
- chore: Update README.md by @DefTruth in #455
- fix load options drop kwargs by @DefTruth in #456
- chore: add maybe pad prompt utils by @DefTruth in #458
- fix: move .to(device) to reduce tp mem by @BBuf in #459
- example: support more overrided args and memory tracker by @BBuf in #461
- Add missing model-path args in example by @BBuf in #463
- UAA: ulysses anything attn w/ zero overhead by @DefTruth in #462
- fix qwen-image multi-gpu mismatch by @BBuf in #464
- Fix more models multi gpu mismatch by @BBuf in #466
- feat: support unshard anything for UAA by @DefTruth in #465
- chore: update qwen-image example for UAA by @DefTruth in #468
- chore: Update README.md by @DefTruth in #470
- chore: Update README.md by @DefTruth in #471
- support skyreels cp and tp ulysses by @BBuf in #469
- always use vae tiling if vram <= 48 GiB for qwen-image by @DefTruth in #472
- chore: Add SkyReelsV2 tp/cp to support-matrix by @BBuf in #473
- fix: correct string literal syntax errors in examples by @BBuf in #475
- feat: allow UAA in compiled graph by @DefTruth in #474
New Contributors
Full Changelog: v1.1.1...v1.1.2
v1.1.1
What's Changed
- chore: Update README.md by @DefTruth in #442
- feat: support step compute mask by @DefTruth in #444
- bugfix: fix bench distill cfg mismatch by @DefTruth in #445
- chore: update step mask docs by @DefTruth in #446
- chore: Update User_Guide.md by @DefTruth in #447
- chore: update README by @DefTruth in #448
- chore: update step mask example by @DefTruth in #449
- chore: hightlight
SCM- step computation mask by @DefTruth in #450 - chore: hightlight
SCM- step computation mask by @DefTruth in #451 - chore: hightlight SCM - step computation mask by @DefTruth in #452
- misc: support quantize and attn backend for flux example by @DefTruth in #453
- misc: add quant and attn backend -> step mask example by @DefTruth in #454
Full Changelog: v1.1.0...v1.1.1
v1.1.0 🎉Context/Tensor Parallelism
🔥Hightlight
We are excited to announce that the 🎉v1.1.0 version of cache-dit has finally been released! It brings 🔥Context Parallelism and 🔥Tensor Parallelism to cache-dit, thus making it a Unified and Flexible Inference Engine for 🤗DiTs. Key features: Unified Cache APIs, Forward Pattern Matching, Block Adapter, DBCache, DBPrune, Cache CFG, TaylorSeer, Context Parallelism, Tensor Parallelism and 🎉SOTA performance.
⚙️Installation
You can install the stable release of cache-dit from PyPI:
pip3 install -U cache-dit # or, pip3 install -U "cache-dit[all]" for all featuresOr you can install the latest develop version from GitHub:
pip3 install git+https://github.com/vipshop/cache-dit.gitPlease also install the latest main branch of diffusers for context parallelism:
pip3 install git+https://github.com/huggingface/diffusers.git🔥Supported DiTs
Tip
One Model Series may contain many pipelines. cache-dit applies optimizations at the Transformer level; thus, any pipelines that include the supported transformer are already supported by cache-dit. ✅: known work and official supported now; ✖️: unofficial supported now, but maybe support in the future; Q: 4-bits models w/ nunchaku + SVDQ W4A4.
| 📚Model | Cache | CP | TP | 📚Model | Cache | CP | TP |
|---|---|---|---|---|---|---|---|
| 🎉FLUX.1 | ✅ | ✅ | ✅ | 🎉FLUX.1 Q |
✅ | ✅ | ✖️ |
| 🎉FLUX.1-Fill | ✅ | ✅ | ✅ | 🎉FLUX.1-Fill Q |
✅ | ✅ | ✖️ |
| 🎉Qwen-Image | ✅ | ✅ | ✅ | 🎉Qwen-Image Q |
✅ | ✅ | ✖️ |
| 🎉Qwen...Edit | ✅ | ✅ | ✅ | 🎉Qwen...Edit Q |
✅ | ✅ | ✖️ |
| 🎉Qwen...Lightning | ✅ | ✅ | ✅ | 🎉Qwen...Light Q |
✅ | ✅ | ✖️ |
| 🎉Qwen...Control.. | ✅ | ✅ | ✅ | 🎉Qwen...E...Light Q |
✅ | ✅ | ✖️ |
| 🎉Wan 2.1 I2V/T2V | ✅ | ✅ | ✅ | 🎉Mochi | ✅ | ✖️ | ✅ |
| 🎉Wan 2.1 VACE | ✅ | ✅ | ✅ | 🎉HiDream | ✅ | ✖️ | ✖️ |
| 🎉Wan 2.2 I2V/T2V | ✅ | ✅ | ✅ | 🎉HunyunDiT | ✅ | ✖️ | ✅ |
| 🎉HunyuanVideo | ✅ | ✅ | ✅ | 🎉Sana | ✅ | ✖️ | ✖️ |
| 🎉ChronoEdit | ✅ | ✅ | ✅ | 🎉Bria | ✅ | ✖️ | ✖️ |
| 🎉CogVideoX | ✅ | ✅ | ✅ | 🎉SkyReelsV2 | ✅ | ✖️ | ✖️ |
| 🎉CogVideoX 1.5 | ✅ | ✅ | ✅ | 🎉Lumina 1/2 | ✅ | ✖️ | ✖️ |
| 🎉CogView4 | ✅ | ✅ | ✅ | 🎉DiT-XL | ✅ | ✅ | ✖️ |
| 🎉CogView3Plus | ✅ | ✅ | ✅ | 🎉Allegro | ✅ | ✖️ | ✖️ |
| 🎉PixArt Sigma | ✅ | ✅ | ✅ | 🎉Cosmos | ✅ | ✖️ | ✖️ |
| 🎉PixArt Alpha | ✅ | ✅ | ✅ | 🎉OmniGen | ✅ | ✖️ | ✖️ |
| 🎉Chroma-HD | ✅ | ✅ | ️✅ | 🎉EasyAnimate | ✅ | ✖️ | ✖️ |
| 🎉VisualCloze | ✅ | ✅ | ✅ | 🎉StableDiffusion3 | ✅ | ✖️ | ✖️ |
| 🎉HunyuanImage | ✅ | ✅ | ✅ | 🎉PRX T2I | ✅ | ✖️ | ✖️ |
| 🎉Kandinsky5 | ✅ | ✅️ | ✅️ | 🎉Amused | ✅ | ✖️ | ✖️ |
| 🎉LTXVideo | ✅ | ✅ | ✅ | 🎉AuraFlow | ✅ | ✖️ | ✖️ |
| 🎉ConsisID | ✅ | ✅ | ✅ | 🎉LongCatVideo | ✅ | ✖️ | ✖️ |
⚡️Hybrid Context Parallelism
cache-dit is compatible with context parallelism. Currently, we support the use of Hybrid Cache + Context Parallelism scheme (via NATIVE_DIFFUSER parallelism backend) in cache-dit. Users can use Context Parallelism to further accelerate the speed of inference! For more details, please refer to 📚examples/parallelism. Currently, cache-dit supported context parallelism for FLUX.1, Qwen-Image, Qwen-Image-Lightning, LTXVideo, Wan 2.1, Wan 2.2, HunyuanImage-2.1, HunyuanVideo, CogVideoX 1.0, CogVideoX 1.5, CogView 3/4 and VisualCloze, etc. cache-dit will support more models in the future.
# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig
cache_dit.enable_cache(
pipe_or_adapter,
cache_config=DBCacheConfig(...),
# Set ulysses_size > 1 to enable ulysses style context parallelism.
parallelism_config=ParallelismConfig(ulysses_size=2),
)
# torchrun --nproc_per_node=2 parallel_cache.py⚡️Hybrid Tensor Parallelism
cache-dit is also compatible with tensor parallelism. Currently, we support the use of Hybrid Cache + Tensor Parallelism scheme (via NATIVE_PYTORCH parallelism backend) in cache-dit. Users can use Tensor Parallelism to further accelerate the speed of inference and reduce the VRAM usage per GPU! For more details, please refer to 📚examples/parallelism. Now, cache-dit supported tensor parallelism for FLUX.1, Qwen-Image, Qwen-Image-Lightning, Wan2.1, Wan2.2, HunyuanImage-2.1, HunyuanVideo and VisualCloze, etc. cache-dit will support more models in the future.
# pip3 install "cache-dit[parallelism]"
from cache_dit import ParallelismConfig
cache_dit.enable_cache(
pipe_or_adapter,
cache_config=DBCacheConfig(...),
# Set tp_size > 1 to enable tensor parallelism.
parallelism_config=ParallelismConfig(tp_size=2),
)
# torchrun --nproc_per_node=2 parallel_cache.pyImportant
Please note that in the short term, we have no plans to support Hybrid Parallelism. Please choose to use either Context Parallelism or Tensor Parallelism based on your actual scenario.
v1.0.16
What's Changed
- feat: support cogview3/4 cogvideox Tensor Parallelism by @gameofdimension in #419
- chore: remove un-needed pytest.ini by @DefTruth in #421
- feat: support pixart models Tensor Parallelism by @gameofdimension in #422
- feat: support chrono-edit context parallel by @DefTruth in #424
- chore: Update README.md by @DefTruth in #425
- feat: support Kandinsky5 context parallel by @DefTruth in #426
- feat: support LTX-Video Tensor Parallelism by @gameofdimension in #428
- chore: Update README.md by @DefTruth in #430
- feat: support ConsisID-preview Tensor Parallelism by @gameofdimension in #431
- bugfix: fix chrono-edit context parallel by @DefTruth in #432
- bugfix: fix chrono-edit context parallel by @DefTruth in #433
- chore: add speedup image by @DefTruth in #434
- chore: update speedup image by @DefTruth in #435
- chore: update speedup image by @DefTruth in #436
- chore: update clip-score bench by @DefTruth in #437
Full Changelog: v1.0.15...v1.0.16
v1.0.15
What's Changed
- feat: support cache & tp for wan vace by @DefTruth in #406
- feat: support mochi-1-preview Tensor Parallelism by @gameofdimension in #408
- chore: Update README.md by @DefTruth in #409
- feat: support HunyuanDiT Tensor Parallelism by @gameofdimension in #411
- bugfix: fix summary stats from dict by @DefTruth in #412
- bugfix: fix strify error while no-cache by @DefTruth in #414
- feat: support wan vace context parallel by @DefTruth in #415
- chore: Update README.md by @DefTruth in #416
- feat: support Wan2.1-VACE Tensor Parallelism by @gameofdimension in #417
- misc: use dummy blocks for flux by default by @DefTruth in #418
Full Changelog: v1.0.14...v1.0.15
