Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
52 commits
Select commit Hold shift + click to select a range
1d8ac33
Add the Skip softmax diffusion
jingyu-ml Apr 2, 2026
1f8f0d3
Add test case
jingyu-ml Apr 2, 2026
5873652
Fixed error
jingyu-ml Apr 2, 2026
4c179a3
Fixed the test case
jingyu-ml Apr 2, 2026
2c323df
Merge branch 'main' into jingyux/diffusion-skip-softmax
jingyu-ml Apr 2, 2026
8702b7b
Removed the token import
jingyu-ml Apr 6, 2026
bbe2123
Merge branch 'main' into jingyux/diffusion-skip-softmax
jingyu-ml Apr 6, 2026
70099a5
removed the unused code
jingyu-ml Apr 6, 2026
6cc96a4
Update the README
jingyu-ml Apr 6, 2026
4de0d3b
Updated the example script
jingyu-ml Apr 7, 2026
b3d3d4d
Update the readme
jingyu-ml Apr 7, 2026
8dc6162
Update the calibration kernel
jingyu-ml Apr 7, 2026
8aa32cc
ADd the readme
jingyu-ml Apr 7, 2026
fbeabcf
Update the example script
jingyu-ml Apr 7, 2026
adcd480
Add vLLM integration for modelopt sparse attention
kaix-nv Mar 23, 2026
20d32d1
Replace impl to support decode
kaix-nv Mar 28, 2026
0186f23
Remove the monkey-patch code and unify impl replacement
kaix-nv Mar 31, 2026
c98fe45
Fix per-layer sparse config and sliding_window
kaix-nv Mar 31, 2026
849609e
Add Wan2.2 SageAttention example for diffusion inference
yeyu-nvidia Mar 23, 2026
fbf5b91
Add sageattn call counters for diagnostics
yeyu-nvidia Mar 23, 2026
1760667
Add SA2 kernel variants (sage2-fp16, sage2-fp8) and full benchmark
yeyu-nvidia Mar 23, 2026
9b19369
Handle missing SA2 kernels gracefully with version detection
yeyu-nvidia Mar 23, 2026
e202a41
Add GPU SM compatibility check and graceful kernel fallback
yeyu-nvidia Mar 23, 2026
6dd9450
Add FP8 E4M3 attention and accuracy metrics to wan2 sage attention ex…
yeyu-nvidia Mar 25, 2026
cf6bd44
Set 5B model as default in wan2 sage attention example
yeyu-nvidia Mar 30, 2026
f7e3922
Add CLIP score to --compare mode for semantic quality evaluation
yeyu-nvidia Mar 30, 2026
bb1bfe6
Fix CLIP scoring: add --clip-model flag, HF_TOKEN support, error hand…
yeyu-nvidia Mar 30, 2026
356395f
Add NVFP4 E2M1 attention kernel to wan2 sage attention example
yeyu-nvidia Mar 30, 2026
d0263a2
Fix CLIP score range comment and update output message
yeyu-nvidia Mar 30, 2026
a3d0b40
Fix nvfp4: quantize Q/K only, keep V in original precision
yeyu-nvidia Mar 30, 2026
63d6168
Fix nvfp4: quantize V to FP8 E4M3 instead of original precision
yeyu-nvidia Mar 30, 2026
9606b41
Fix nvfp4: keep V in BF16 to preserve visual quality
yeyu-nvidia Mar 30, 2026
0a6e783
Fix nvfp4: quantize post-softmax P instead of Q/K (SA3-faithful)
yeyu-nvidia Mar 30, 2026
56aae7f
Replace nvfp4 kernel with int4: 16-level P quantization for diffusion
yeyu-nvidia Mar 30, 2026
4b3db2d
Fix nvfp4: use per-tile NVFP4 quantization of P (64x64 tiles)
yeyu-nvidia Mar 31, 2026
f6a6ae5
Fix nvfp4 OOM: chunked row processing to avoid full NxN attention matrix
yeyu-nvidia Apr 1, 2026
dd44dc3
Add --nvfp4-tile: configurable per-tile NVFP4 P quantization granularity
yeyu-nvidia Apr 1, 2026
da3781a
Add diffusers_triton backend for WAN2.2 sparse attention
yeyu-nvidia Apr 3, 2026
9ef92f1
Add unit tests for diffusers WAN sparse attention plugin
yeyu-nvidia Apr 3, 2026
0186bd4
Fix TestForwardShape: use m.disable() instead of proc._enabled=False
yeyu-nvidia Apr 3, 2026
547771b
Add NVFP4 P-matrix attention quantization for WAN2.2 via mtq.quantize()
yeyu-nvidia Apr 3, 2026
15d6545
Fix OOM in _QuantWanAttnProcessor: chunk attention rows to avoid N×N …
yeyu-nvidia Apr 3, 2026
2619db3
Add NVFP4 P-matrix quantization to Triton flash-attention kernel
yeyu-nvidia Apr 3, 2026
a544476
Fix WanSparseAttentionModule.forward: call processor directly
yeyu-nvidia Apr 3, 2026
5a40699
Fix WanAttention sparse registration order: diffusers plugin before H…
yeyu-nvidia Apr 7, 2026
111c4b2
Lower default skip-softmax threshold and add --skip-threshold CLI arg
yeyu-nvidia Apr 3, 2026
70a9297
Fix skip-softmax threshold formula: remove erroneous * sm_scale factor
yeyu-nvidia Apr 7, 2026
9924628
Address PR review comments
yeyu-nvidia Apr 7, 2026
3ed4ba8
Address remaining PR review comments in triton_fa.py
yeyu-nvidia Apr 7, 2026
3f0bfd3
Revert skip-softmax threshold formula change: restore * sm_scale
yeyu-nvidia Apr 8, 2026
68f63b6
Build SageAttention as standalone quantization feature
yeyu-nvidia Apr 8, 2026
356c517
Add nvfp4 kernel choice for standalone SageAttention in example script
yeyu-nvidia Apr 8, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions examples/diffusers/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@ Cache Diffusion is a technique that reuses cached outputs from previous diffusio
| Pre-Requisites | Required & optional packages to use this technique | \[[Link](#pre-requisites)\] | |
| Getting Started | Learn how to optimize your models using quantization/cache diffusion to reduce precision and improve inference efficiency | \[[Link](#getting-started)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Support Matrix | View the support matrix to see quantization/cahce diffusion compatibility and feature availability across different models | \[[Link](#support-matrix)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Sparse Attention (Skip-Softmax) | Skip-softmax sparse attention for diffusion models | \[[Link](#sparse-attention-skip-softmax)\] | |
| Cache Diffusion | Caching technique to accelerate inference without compromising quality | \[[Link](#cache-diffusion)\] | |
| Post Training Quantization (PTQ) | Example scripts on how to run PTQ on diffusion models | \[[Link](#post-training-quantization-ptq)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
| Quantization Aware Training (QAT) | Example scripts on how to run QAT on diffusion models | \[[Link](#quantization-aware-training-qat)\] | \[[docs](https://nvidia.github.io/Model-Optimizer/guides/1_quantization.html)\] |
Expand Down Expand Up @@ -276,6 +277,67 @@ mto.restore(pipe.unet, your_quantized_ckpt)

By following these steps, your PEFT LoRA model should be efficiently quantized using ModelOpt, ready for deployment while maximizing performance.

## Sparse Attention (Skip-Softmax)

Skip-softmax sparse attention skips KV tiles whose attention scores are negligible during the softmax computation, reducing FLOPs without retraining. An exponential model (`scale_factor = a * exp(b * target_sparsity)`) is calibrated once, then the target sparsity can be adjusted at runtime without recalibration.

### Getting Started

```python
import modelopt.torch.sparsity.attention_sparsity as mtsa

# 1. Define config with calibration
config = {
"sparse_cfg": {
"calibration": {
"target_sparse_ratio": {"prefill": 0.5},
"threshold_trials": [1e-6, 5e-6, 1e-5, 5e-5, 1e-4, 5e-4, 1e-3, 5e-3,
1e-2, 2e-2, 5e-2, 1e-1, 2e-1, 3e-1, 5e-1, 7e-1,
8e-1, 9e-1, 9.9e-1],
},
"*.attn1": {
"method": "triton_skip_softmax",
"backend": "triton",
"is_causal": False,
"collect_stats": True,
"enable": True,
},
"*.attn2": {"enable": False},
"default": {"enable": False},
},
}

# 2. Provide a calibration forward loop
def forward_loop(model):
pipeline(prompt="a cat", num_frames=81, num_inference_steps=40, ...)

# 3. Sparsify + calibrate
mtsa.sparsify(transformer, config, forward_loop=forward_loop)

# 4. Generate as usual — sparsity is applied automatically
output = pipeline(prompt="a dog on the beach", ...)
```

### Example Scripts

#### Wan 2.2 [Script](./sparsity/wan22_skip_softmax.py)

The 14B model automatically sparsifies both `transformer` and `transformer_2`.

```bash
# 5B model — calibrate + generate (4 prompts from OpenVid-1M, 151 frames, 40 steps)
python sparsity/wan22_skip_softmax.py \
--model-path Wan-AI/Wan2.2-TI2V-5B-Diffusers \
--calibrate --target-sparsity 0.5 --calib-size 4 \
--prompt "A sunset over mountains" --output out.mp4

# 14B model (both transformers sparsified)
python sparsity/wan22_skip_softmax.py \
--model-path Wan-AI/Wan2.2-T2V-A14B-Diffusers \
--calibrate --target-sparsity 0.5 --calib-size 4 \
--prompt "A sunset over mountains" --output out.mp4
```

## Cache Diffusion

Cache Diffusion methods, such as [DeepCache](https://arxiv.org/abs/2312.00858), [Block Caching](https://arxiv.org/abs/2312.03209) and [T-Gate](https://arxiv.org/abs/2404.02747), optimize performance by reusing cached outputs from previous steps instead of recalculating them. This **training-free** caching approach is compatible with a variety of models, like **DiT** and **UNet**, enabling considerable acceleration without compromising quality.
Expand Down
Loading
Loading