Add profiler for flux tp and cp example #501

BBuf · 2025-11-28T15:34:46Z

TP2

parallelism torchrun --nproc_per_node=2 run_flux_tp.py --parallel  tp --cache  --track-memory --profile
W1128 15:30:50.596000 163016 torch/distributed/run.py:774] 
W1128 15:30:50.596000 163016 torch/distributed/run.py:774] *****************************************
W1128 15:30:50.596000 163016 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1128 15:30:50.596000 163016 torch/distributed/run.py:774] *****************************************
WARNING 11-28 15:30:56 [_attention_dispatch.py:56] Re-registering NATIVE attention backend to enable context parallelism. This is a temporary workaround and should be removed after the native attention backend supports context parallelism natively. Please check: https://github.com/huggingface/diffusers/pull/12563 for more details. Or, you can disable this behavior by setting the environment variable `CACHE_DIT_ENABLE_CUSTOM_CP_NATIVE_ATTN_DISPATCH=0`.
Namespace(cache=True, compile=False, fuse_lora=False, steps=None, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=None, width=None, quantize=False, quantize_type='float8_weight_only', parallel_type='tp', attn=None, perf=False, prompt=None, negative_prompt=None, model_path=None, track_memory=True, ulysses_anything=False, ulysses_async_qkv_proj=False, disable_compute_comm_overlap=False, profile=True, profile_name=None, profile_dir=None, profile_activities=['CPU', 'GPU'], profile_with_stack=True, profile_record_shapes=True)
WARNING 11-28 15:30:56 [_attention_dispatch.py:56] Re-registering NATIVE attention backend to enable context parallelism. This is a temporary workaround and should be removed after the native attention backend supports context parallelism natively. Please check: https://github.com/huggingface/diffusers/pull/12563 for more details. Or, you can disable this behavior by setting the environment variable `CACHE_DIT_ENABLE_CUSTOM_CP_NATIVE_ATTN_DISPATCH=0`.
Namespace(cache=True, compile=False, fuse_lora=False, steps=None, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=None, width=None, quantize=False, quantize_type='float8_weight_only', parallel_type='tp', attn=None, perf=False, prompt=None, negative_prompt=None, model_path=None, track_memory=True, ulysses_anything=False, ulysses_async_qkv_proj=False, disable_compute_comm_overlap=False, profile=True, profile_name=None, profile_dir=None, profile_activities=['CPU', 'GPU'], profile_with_stack=True, profile_record_shapes=True)
Loading pipeline components...:   0%|                                                                                                                                           | 0/7 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 42.69it/s]
Loading pipeline components...:  29%|█████████████████████████████████████▍                                                                                             | 2/7 [00:00<00:00,  5.05it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 71.87it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers                                                                                        | 0/2 [00:00<?, ?it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.62it/s]
INFO 11-28 15:30:57 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!                                        | 0/3 [00:00<?, ?it/s]
INFO 11-28 15:30:57 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-28 15:30:57 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:30:57 [block_adapters.py:494] Match Block Forward Pattern: ['FluxTransformerBlock', 'FluxSingleTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-28 15:30:57 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-28 15:30:57 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-28 15:30:57 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-28 15:30:57 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140406580690528, context_manager: FluxPipeline_140405576308960.
INFO 11-28 15:30:57 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 40.69it/s]
Loading pipeline components...:  43%|████████████████████████████████████████████████████████▏                                                                          | 3/7 [00:00<00:00,  7.61it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 127.14it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.36it/s]
INFO 11-28 15:30:57 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!
INFO 11-28 15:30:57 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-28 15:30:57 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:30:57 [block_adapters.py:494] Match Block Forward Pattern: ['FluxSingleTransformerBlock', 'FluxTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-28 15:30:57 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-28 15:30:57 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-28 15:30:57 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-28 15:30:57 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140060816224576, context_manager: FluxPipeline_140060818899744.
INFO 11-28 15:30:57 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:31:21 [tp_plan_flux.py:62] Also applied Tensor Parallelism to extra module T5EncoderModel, id:140406460070032
INFO 11-28 15:31:21 [parallel_interface.py:48] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2), transformer id:140406460328288
INFO 11-28 15:31:21 [tp_plan_flux.py:62] Also applied Tensor Parallelism to extra module T5EncoderModel, id:140060815667712
INFO 11-28 15:31:21 [parallel_interface.py:48] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2), transformer id:140060814944320
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.65it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.36it/s]
INFO 11-28 15:31:58 [utils.py:40] Peak GPU memory usage: 18.60 GB
Profiler traces saved to: /tmp/cache_dit_profiles/flux_tp_inference-rank0.trace.json.gz
INFO 11-28 15:31:58 [utils.py:40] Peak GPU memory usage: 18.60 GB
WARNING 11-28 15:31:58 [summary.py:275] Can't find Context Options for: FluxSingleTransformerBlock
WARNING 11-28 15:31:58 [summary.py:284] Can't find Parallelism Config for: FluxSingleTransformerBlock
WARNING 11-28 15:31:58 [summary.py:275] Can't find Context Options for: FluxTransformerBlock
WARNING 11-28 15:31:58 [summary.py:284] Can't find Parallelism Config for: FluxTransformerBlock

🤗Context Options: FluxTransformer2DModel

{'cache_config': DBCacheConfig(cache_type=<CacheType.DBCache: 'DBCache'>, Fn_compute_blocks=8, Bn_compute_blocks=0, residual_diff_threshold=0.08, max_accumulated_residual_diff_threshold=None, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, enable_separate_cfg=False, cfg_compute_first=False, cfg_diff_compute_separate=True, num_inference_steps=None, steps_computation_mask=None, steps_computation_policy='dynamic'), 'name': 'transformer_blocks_140060816224576'}

🤖Parallelism Config: FluxTransformer2DModel

ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2)
Time cost: 31.93s
Saving image to flux.C0_Q0_DBCache_F8B0_W8I1M0MC0_R0.08_T0O0_TP2.png

Ulysses CP2

➜  parallelism torchrun --nproc_per_node=2 run_flux_cp.py --parallel  ulysses --cache  --track-memory --profile
W1128 15:36:16.247000 163789 torch/distributed/run.py:774] 
W1128 15:36:16.247000 163789 torch/distributed/run.py:774] *****************************************
W1128 15:36:16.247000 163789 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1128 15:36:16.247000 163789 torch/distributed/run.py:774] *****************************************
WARNING 11-28 15:36:22 [_attention_dispatch.py:56] Re-registering NATIVE attention backend to enable context parallelism. This is a temporary workaround and should be removed after the native attention backend supports context parallelism natively. Please check: https://github.com/huggingface/diffusers/pull/12563 for more details. Or, you can disable this behavior by setting the environment variable `CACHE_DIT_ENABLE_CUSTOM_CP_NATIVE_ATTN_DISPATCH=0`.
Namespace(cache=True, compile=False, fuse_lora=False, steps=None, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=None, width=None, quantize=False, quantize_type='float8_weight_only', parallel_type='ulysses', attn=None, perf=False, prompt=None, negative_prompt=None, model_path=None, track_memory=True, ulysses_anything=False, ulysses_async_qkv_proj=False, disable_compute_comm_overlap=False, profile=True, profile_name=None, profile_dir=None, profile_activities=['CPU', 'GPU'], profile_with_stack=True, profile_record_shapes=True)
WARNING 11-28 15:36:22 [_attention_dispatch.py:56] Re-registering NATIVE attention backend to enable context parallelism. This is a temporary workaround and should be removed after the native attention backend supports context parallelism natively. Please check: https://github.com/huggingface/diffusers/pull/12563 for more details. Or, you can disable this behavior by setting the environment variable `CACHE_DIT_ENABLE_CUSTOM_CP_NATIVE_ATTN_DISPATCH=0`.
Namespace(cache=True, compile=False, fuse_lora=False, steps=None, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=None, width=None, quantize=False, quantize_type='float8_weight_only', parallel_type='ulysses', attn=None, perf=False, prompt=None, negative_prompt=None, model_path=None, track_memory=True, ulysses_anything=False, ulysses_async_qkv_proj=False, disable_compute_comm_overlap=False, profile=True, profile_name=None, profile_dir=None, profile_activities=['CPU', 'GPU'], profile_with_stack=True, profile_record_shapes=True)
Loading pipeline components...:   0%|                                                                                                                                           | 0/7 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 129.49it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 74.64it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 75.80it/s]
Loading pipeline components...:  71%|█████████████████████████████████████████████████████████████████████████████████████████████▌                                     | 5/7 [00:00<00:00, 15.62it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading pipeline components...:  71%|█████████████████████████████████████████████████████████████████████████████████████████████▌                                     | 5/7 [00:00<00:00, 16.49it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 130.90it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers                                                                                        | 0/2 [00:00<?, ?it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 13.06it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 12.76it/s]
INFO 11-28 15:36:29 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!
INFO 11-28 15:36:29 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!
INFO 11-28 15:36:29 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-28 15:36:29 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:36:29 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-28 15:36:29 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:36:29 [block_adapters.py:494] Match Block Forward Pattern: ['FluxSingleTransformerBlock', 'FluxTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-28 15:36:29 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-28 15:36:29 [block_adapters.py:494] Match Block Forward Pattern: ['FluxTransformerBlock', 'FluxSingleTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-28 15:36:29 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-28 15:36:29 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-28 15:36:29 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-28 15:36:29 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-28 15:36:29 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-28 15:36:29 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140244512954112, context_manager: FluxPipeline_140244554264032.
INFO 11-28 15:36:29 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140640997026672, context_manager: FluxPipeline_140641038293072.
INFO 11-28 15:36:29 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:36:29 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
Attention backends are an experimental feature and the API may be subject to change.
Attention backends are an experimental feature and the API may be subject to change.
INFO 11-28 15:36:29 [__init__.py:71] Found attention_backend from config, set attention backend to: _native_cudnn
INFO 11-28 15:36:29 [__init__.py:71] Found attention_backend from config, set attention backend to: _native_cudnn
`enable_parallelism` is an experimental feature. The API may change in the future and breaking changes may be introduced at any time without warning.
`enable_parallelism` is an experimental feature. The API may change in the future and breaking changes may be introduced at any time without warning.
INFO 11-28 15:36:29 [parallel_interface.py:48] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_DIFFUSER, ulysses_size=2, ring_size=None, tp_size=None), transformer id:140640996831168
INFO 11-28 15:36:29 [parallel_interface.py:48] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_DIFFUSER, ulysses_size=2, ring_size=None, tp_size=None), transformer id:140244507909232
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.52it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.90it/s]
INFO 11-28 15:36:43 [utils.py:40] Peak GPU memory usage: 33.88 GB
Profiler traces saved to: /tmp/cache_dit_profiles/flux_cp_inference-rank0.trace.json.gz
INFO 11-28 15:36:43 [utils.py:40] Peak GPU memory usage: 33.88 GB
WARNING 11-28 15:36:43 [summary.py:275] Can't find Context Options for: FluxSingleTransformerBlock
WARNING 11-28 15:36:43 [summary.py:284] Can't find Parallelism Config for: FluxSingleTransformerBlock
WARNING 11-28 15:36:43 [summary.py:275] Can't find Context Options for: FluxTransformerBlock
WARNING 11-28 15:36:43 [summary.py:284] Can't find Parallelism Config for: FluxTransformerBlock

🤗Context Options: FluxTransformer2DModel

{'cache_config': DBCacheConfig(cache_type=<CacheType.DBCache: 'DBCache'>, Fn_compute_blocks=8, Bn_compute_blocks=0, residual_diff_threshold=0.08, max_accumulated_residual_diff_threshold=None, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, enable_separate_cfg=False, cfg_compute_first=False, cfg_diff_compute_separate=True, num_inference_steps=None, steps_computation_mask=None, steps_computation_policy='dynamic'), 'name': 'transformer_blocks_140640997026672'}

🤖Parallelism Config: FluxTransformer2DModel

ParallelismConfig(backend=ParallelismBackend.NATIVE_DIFFUSER, ulysses_size=2, ring_size=None, tp_size=None)
Time cost: 9.16s
Saving image to flux.1024x1024.C0_Q0_DBCache_F8B0_W8I1M0MC0_R0.08_T0O0_Ulysses2.png

all2all :

DefTruth

LGTM

add profiler for flux tp and cp example

9bc5ae0

DefTruth self-requested a review November 29, 2025 06:38

DefTruth approved these changes Nov 29, 2025

View reviewed changes

DefTruth merged commit 29a8345 into vipshop:main Nov 29, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add profiler for flux tp and cp example #501

Add profiler for flux tp and cp example #501

Uh oh!

BBuf commented Nov 28, 2025 •

edited

Loading

Uh oh!

DefTruth left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Add profiler for flux tp and cp example #501

Add profiler for flux tp and cp example #501

Uh oh!

Conversation

BBuf commented Nov 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TP2

Ulysses CP2

Uh oh!

DefTruth left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

BBuf commented Nov 28, 2025 •

edited

Loading