Skip to content

Conversation

@BBuf
Copy link
Collaborator

@BBuf BBuf commented Nov 28, 2025

TP2

parallelism torchrun --nproc_per_node=2 run_flux_tp.py --parallel  tp --cache  --track-memory --profile
W1128 15:30:50.596000 163016 torch/distributed/run.py:774] 
W1128 15:30:50.596000 163016 torch/distributed/run.py:774] *****************************************
W1128 15:30:50.596000 163016 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1128 15:30:50.596000 163016 torch/distributed/run.py:774] *****************************************
WARNING 11-28 15:30:56 [_attention_dispatch.py:56] Re-registering NATIVE attention backend to enable context parallelism. This is a temporary workaround and should be removed after the native attention backend supports context parallelism natively. Please check: https://github.com/huggingface/diffusers/pull/12563 for more details. Or, you can disable this behavior by setting the environment variable `CACHE_DIT_ENABLE_CUSTOM_CP_NATIVE_ATTN_DISPATCH=0`.
Namespace(cache=True, compile=False, fuse_lora=False, steps=None, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=None, width=None, quantize=False, quantize_type='float8_weight_only', parallel_type='tp', attn=None, perf=False, prompt=None, negative_prompt=None, model_path=None, track_memory=True, ulysses_anything=False, ulysses_async_qkv_proj=False, disable_compute_comm_overlap=False, profile=True, profile_name=None, profile_dir=None, profile_activities=['CPU', 'GPU'], profile_with_stack=True, profile_record_shapes=True)
WARNING 11-28 15:30:56 [_attention_dispatch.py:56] Re-registering NATIVE attention backend to enable context parallelism. This is a temporary workaround and should be removed after the native attention backend supports context parallelism natively. Please check: https://github.com/huggingface/diffusers/pull/12563 for more details. Or, you can disable this behavior by setting the environment variable `CACHE_DIT_ENABLE_CUSTOM_CP_NATIVE_ATTN_DISPATCH=0`.
Namespace(cache=True, compile=False, fuse_lora=False, steps=None, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=None, width=None, quantize=False, quantize_type='float8_weight_only', parallel_type='tp', attn=None, perf=False, prompt=None, negative_prompt=None, model_path=None, track_memory=True, ulysses_anything=False, ulysses_async_qkv_proj=False, disable_compute_comm_overlap=False, profile=True, profile_name=None, profile_dir=None, profile_activities=['CPU', 'GPU'], profile_with_stack=True, profile_record_shapes=True)
Loading pipeline components...:   0%|                                                                                                                                           | 0/7 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 42.69it/s]
Loading pipeline components...:  29%|█████████████████████████████████████▍                                                                                             | 2/7 [00:00<00:00,  5.05it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 71.87it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers                                                                                        | 0/2 [00:00<?, ?it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.62it/s]
INFO 11-28 15:30:57 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!                                        | 0/3 [00:00<?, ?it/s]
INFO 11-28 15:30:57 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-28 15:30:57 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:30:57 [block_adapters.py:494] Match Block Forward Pattern: ['FluxTransformerBlock', 'FluxSingleTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-28 15:30:57 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-28 15:30:57 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-28 15:30:57 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-28 15:30:57 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140406580690528, context_manager: FluxPipeline_140405576308960.
INFO 11-28 15:30:57 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 40.69it/s]
Loading pipeline components...:  43%|████████████████████████████████████████████████████████▏                                                                          | 3/7 [00:00<00:00,  7.61it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 127.14it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 10.36it/s]
INFO 11-28 15:30:57 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!
INFO 11-28 15:30:57 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-28 15:30:57 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:30:57 [block_adapters.py:494] Match Block Forward Pattern: ['FluxSingleTransformerBlock', 'FluxTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-28 15:30:57 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-28 15:30:57 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-28 15:30:57 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-28 15:30:57 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140060816224576, context_manager: FluxPipeline_140060818899744.
INFO 11-28 15:30:57 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:31:21 [tp_plan_flux.py:62] Also applied Tensor Parallelism to extra module T5EncoderModel, id:140406460070032
INFO 11-28 15:31:21 [parallel_interface.py:48] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2), transformer id:140406460328288
INFO 11-28 15:31:21 [tp_plan_flux.py:62] Also applied Tensor Parallelism to extra module T5EncoderModel, id:140060815667712
INFO 11-28 15:31:21 [parallel_interface.py:48] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2), transformer id:140060814944320
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 5/5 [00:01<00:00,  4.65it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  2.36it/s]
INFO 11-28 15:31:58 [utils.py:40] Peak GPU memory usage: 18.60 GB
Profiler traces saved to: /tmp/cache_dit_profiles/flux_tp_inference-rank0.trace.json.gz
INFO 11-28 15:31:58 [utils.py:40] Peak GPU memory usage: 18.60 GB
WARNING 11-28 15:31:58 [summary.py:275] Can't find Context Options for: FluxSingleTransformerBlock
WARNING 11-28 15:31:58 [summary.py:284] Can't find Parallelism Config for: FluxSingleTransformerBlock
WARNING 11-28 15:31:58 [summary.py:275] Can't find Context Options for: FluxTransformerBlock
WARNING 11-28 15:31:58 [summary.py:284] Can't find Parallelism Config for: FluxTransformerBlock

🤗Context Options: FluxTransformer2DModel

{'cache_config': DBCacheConfig(cache_type=<CacheType.DBCache: 'DBCache'>, Fn_compute_blocks=8, Bn_compute_blocks=0, residual_diff_threshold=0.08, max_accumulated_residual_diff_threshold=None, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, enable_separate_cfg=False, cfg_compute_first=False, cfg_diff_compute_separate=True, num_inference_steps=None, steps_computation_mask=None, steps_computation_policy='dynamic'), 'name': 'transformer_blocks_140060816224576'}

🤖Parallelism Config: FluxTransformer2DModel

ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2)
Time cost: 31.93s
Saving image to flux.C0_Q0_DBCache_F8B0_W8I1M0MC0_R0.08_T0O0_TP2.png
图片

Ulysses CP2

➜  parallelism torchrun --nproc_per_node=2 run_flux_cp.py --parallel  ulysses --cache  --track-memory --profile
W1128 15:36:16.247000 163789 torch/distributed/run.py:774] 
W1128 15:36:16.247000 163789 torch/distributed/run.py:774] *****************************************
W1128 15:36:16.247000 163789 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1128 15:36:16.247000 163789 torch/distributed/run.py:774] *****************************************
WARNING 11-28 15:36:22 [_attention_dispatch.py:56] Re-registering NATIVE attention backend to enable context parallelism. This is a temporary workaround and should be removed after the native attention backend supports context parallelism natively. Please check: https://github.com/huggingface/diffusers/pull/12563 for more details. Or, you can disable this behavior by setting the environment variable `CACHE_DIT_ENABLE_CUSTOM_CP_NATIVE_ATTN_DISPATCH=0`.
Namespace(cache=True, compile=False, fuse_lora=False, steps=None, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=None, width=None, quantize=False, quantize_type='float8_weight_only', parallel_type='ulysses', attn=None, perf=False, prompt=None, negative_prompt=None, model_path=None, track_memory=True, ulysses_anything=False, ulysses_async_qkv_proj=False, disable_compute_comm_overlap=False, profile=True, profile_name=None, profile_dir=None, profile_activities=['CPU', 'GPU'], profile_with_stack=True, profile_record_shapes=True)
WARNING 11-28 15:36:22 [_attention_dispatch.py:56] Re-registering NATIVE attention backend to enable context parallelism. This is a temporary workaround and should be removed after the native attention backend supports context parallelism natively. Please check: https://github.com/huggingface/diffusers/pull/12563 for more details. Or, you can disable this behavior by setting the environment variable `CACHE_DIT_ENABLE_CUSTOM_CP_NATIVE_ATTN_DISPATCH=0`.
Namespace(cache=True, compile=False, fuse_lora=False, steps=None, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=None, width=None, quantize=False, quantize_type='float8_weight_only', parallel_type='ulysses', attn=None, perf=False, prompt=None, negative_prompt=None, model_path=None, track_memory=True, ulysses_anything=False, ulysses_async_qkv_proj=False, disable_compute_comm_overlap=False, profile=True, profile_name=None, profile_dir=None, profile_activities=['CPU', 'GPU'], profile_with_stack=True, profile_record_shapes=True)
Loading pipeline components...:   0%|                                                                                                                                           | 0/7 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 129.49it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 74.64it/s]
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 75.80it/s]
Loading pipeline components...:  71%|█████████████████████████████████████████████████████████████████████████████████████████████▌                                     | 5/7 [00:00<00:00, 15.62it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading pipeline components...:  71%|█████████████████████████████████████████████████████████████████████████████████████████████▌                                     | 5/7 [00:00<00:00, 16.49it/s]You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 130.90it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers                                                                                        | 0/2 [00:00<?, ?it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 13.06it/s]
Loading pipeline components...: 100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 12.76it/s]
INFO 11-28 15:36:29 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!
INFO 11-28 15:36:29 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!
INFO 11-28 15:36:29 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-28 15:36:29 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:36:29 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-28 15:36:29 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:36:29 [block_adapters.py:494] Match Block Forward Pattern: ['FluxSingleTransformerBlock', 'FluxTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-28 15:36:29 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-28 15:36:29 [block_adapters.py:494] Match Block Forward Pattern: ['FluxTransformerBlock', 'FluxSingleTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-28 15:36:29 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-28 15:36:29 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-28 15:36:29 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-28 15:36:29 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-28 15:36:29 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-28 15:36:29 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140244512954112, context_manager: FluxPipeline_140244554264032.
INFO 11-28 15:36:29 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140640997026672, context_manager: FluxPipeline_140641038293072.
INFO 11-28 15:36:29 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-28 15:36:29 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
Attention backends are an experimental feature and the API may be subject to change.
Attention backends are an experimental feature and the API may be subject to change.
INFO 11-28 15:36:29 [__init__.py:71] Found attention_backend from config, set attention backend to: _native_cudnn
INFO 11-28 15:36:29 [__init__.py:71] Found attention_backend from config, set attention backend to: _native_cudnn
`enable_parallelism` is an experimental feature. The API may change in the future and breaking changes may be introduced at any time without warning.
`enable_parallelism` is an experimental feature. The API may change in the future and breaking changes may be introduced at any time without warning.
INFO 11-28 15:36:29 [parallel_interface.py:48] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_DIFFUSER, ulysses_size=2, ring_size=None, tp_size=None), transformer id:140640996831168
INFO 11-28 15:36:29 [parallel_interface.py:48] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_DIFFUSER, ulysses_size=2, ring_size=None, tp_size=None), transformer id:140244507909232
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:01<00:00,  1.52it/s]
100%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00,  5.90it/s]
INFO 11-28 15:36:43 [utils.py:40] Peak GPU memory usage: 33.88 GB
Profiler traces saved to: /tmp/cache_dit_profiles/flux_cp_inference-rank0.trace.json.gz
INFO 11-28 15:36:43 [utils.py:40] Peak GPU memory usage: 33.88 GB
WARNING 11-28 15:36:43 [summary.py:275] Can't find Context Options for: FluxSingleTransformerBlock
WARNING 11-28 15:36:43 [summary.py:284] Can't find Parallelism Config for: FluxSingleTransformerBlock
WARNING 11-28 15:36:43 [summary.py:275] Can't find Context Options for: FluxTransformerBlock
WARNING 11-28 15:36:43 [summary.py:284] Can't find Parallelism Config for: FluxTransformerBlock

🤗Context Options: FluxTransformer2DModel

{'cache_config': DBCacheConfig(cache_type=<CacheType.DBCache: 'DBCache'>, Fn_compute_blocks=8, Bn_compute_blocks=0, residual_diff_threshold=0.08, max_accumulated_residual_diff_threshold=None, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, enable_separate_cfg=False, cfg_compute_first=False, cfg_diff_compute_separate=True, num_inference_steps=None, steps_computation_mask=None, steps_computation_policy='dynamic'), 'name': 'transformer_blocks_140640997026672'}

🤖Parallelism Config: FluxTransformer2DModel

ParallelismConfig(backend=ParallelismBackend.NATIVE_DIFFUSER, ulysses_size=2, ring_size=None, tp_size=None)
Time cost: 9.16s
Saving image to flux.1024x1024.C0_Q0_DBCache_F8B0_W8I1M0MC0_R0.08_T0O0_Ulysses2.png
图片

all2all :

2e121e4a-4ed5-4272-a732-80d018b3548b

@DefTruth DefTruth self-requested a review November 29, 2025 06:38
Copy link
Member

@DefTruth DefTruth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@DefTruth DefTruth merged commit 29a8345 into vipshop:main Nov 29, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants