Skip to content

Conversation

@BBuf
Copy link
Collaborator

@BBuf BBuf commented Nov 20, 2025

This PR adds flexible command-line arguments and GPU memory tracking capabilities to all example scripts, making them more convenient for testing and benchmarking.

  1. Customizable Prompts

    • --prompt: Override the default prompt in examples
    • --negative-prompt: Override the default negative prompt in examples
  2. Flexible Model Path

    • --model-path: Override the model loading path, useful for testing different model versions or local checkpoints
  3. Memory Tracking

    • --track-memory: Track and report peak GPU memory usage during inference
    • Implemented via a new MemoryTracker context manager class in utils.py

How to use?

Flux.1.dev H100 1GPU with custom model path, custom prompt and memory tracker

python3 run_flux.py --height 720 --width 720 --steps 50 --cache --compile --prompt "A curious raccoon" --model-path /home/lmsys/bbuf/FLUX___1-dev --track-memory
WARNING 11-20 10:21:50 [pattern_base.py:25] Context parallelism requires the 'diffusers>=0.36.dev0'.Please install latest version of diffusers from source: 
WARNING 11-20 10:21:50 [pattern_base.py:25] pip3 install git+https://github.com/huggingface/diffusers.git
Namespace(cache=True, compile=True, fuse_lora=False, steps=50, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=720, width=720, quantize=False, quantize_type='float8_weight_only', parallel_type=None, attn=None, perf=False, prompt='A curious raccoon', negative_prompt=None, model_path='/home/lmsys/bbuf/FLUX___1-dev', track_memory=True)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 60.12it/s]
Loading pipeline components...:  14%|█████████▍                                                        | 1/7 [00:00<00:01,  5.73it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 138.02it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers                       | 0/2 [00:00<?, ?it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 14.90it/s]
INFO 11-20 10:21:50 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!
INFO 11-20 10:21:50 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-20 10:21:50 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-20 10:21:50 [block_adapters.py:494] Match Block Forward Pattern: ['FluxTransformerBlock', 'FluxSingleTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-20 10:21:50 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-20 10:21:50 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-20 10:21:50 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-20 10:21:50 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140396272493024, context_manager: FluxPipeline_140396197738592.
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:05<00:00,  9.27it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:02<00:00, 18.63it/s]
INFO 11-20 10:22:20 [utils.py:40] Peak GPU memory usage: 32.68 GB
WARNING 11-20 10:22:20 [summary.py:275] Can't find Context Options for: FluxSingleTransformerBlock
WARNING 11-20 10:22:20 [summary.py:284] Can't find Parallelism Config for: FluxSingleTransformerBlock
WARNING 11-20 10:22:20 [summary.py:275] Can't find Context Options for: FluxTransformerBlock
WARNING 11-20 10:22:20 [summary.py:284] Can't find Parallelism Config for: FluxTransformerBlock

🤗Context Options: OptimizedModule

{'cache_config': DBCacheConfig(cache_type=<CacheType.DBCache: 'DBCache'>, Fn_compute_blocks=8, Bn_compute_blocks=0, residual_diff_threshold=0.08, max_accumulated_residual_diff_threshold=None, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, enable_separate_cfg=False, cfg_compute_first=False, cfg_diff_compute_separate=True, num_inference_steps=None, steps_computation_mask=None, steps_computation_policy='dynamic'), 'name': 'transformer_blocks_140396272493024'}
WARNING 11-20 10:22:20 [summary.py:284] Can't find Parallelism Config for: OptimizedModule

⚡️Cache Steps and Residual Diffs Statistics: OptimizedModule

| Cache Steps | Diffs P00 | Diffs P25 | Diffs P50 | Diffs P75 | Diffs P95 | Diffs Min | Diffs Max |
|-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| 24          | 0.027     | 0.047     | 0.073     | 0.099     | 0.14      | 0.027     | 0.174     |

Time cost: 2.82s
Saving image to flux.C1_Q0_DBCache_F8B0_W8I1M0MC0_R0.08_T0O0_S24.png
➜  pipeline 
d0a566a9-818c-4767-aad8-e383e5404d58

Flux.1.dev tp2 parallel

 parallelism torchrun --nproc_per_node=2 run_flux_tp.py --parallel tp --height 720 --width 720 --steps 50 --prompt "A curious raccoon"  --model-path /home/lmsys/bbuf/FLUX___1-dev --track-memory   
W1120 11:59:37.958000 65189 torch/distributed/run.py:774] 
W1120 11:59:37.958000 65189 torch/distributed/run.py:774] *****************************************
W1120 11:59:37.958000 65189 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1120 11:59:37.958000 65189 torch/distributed/run.py:774] *****************************************
WARNING 11-20 11:59:44 [pattern_base.py:25] Context parallelism requires the 'diffusers>=0.36.dev0'.Please install latest version of diffusers from source: 
WARNING 11-20 11:59:44 [pattern_base.py:25] pip3 install git+https://github.com/huggingface/diffusers.git
Namespace(cache=False, compile=False, fuse_lora=False, steps=50, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=720, width=720, quantize=False, quantize_type='float8_weight_only', parallel_type='tp', attn=None, perf=False, prompt='A curious raccoon', negative_prompt=None, model_path='/home/lmsys/bbuf/FLUX___1-dev', track_memory=True)
WARNING 11-20 11:59:44 [pattern_base.py:25] Context parallelism requires the 'diffusers>=0.36.dev0'.Please install latest version of diffusers from source: 
WARNING 11-20 11:59:44 [pattern_base.py:25] pip3 install git+https://github.com/huggingface/diffusers.git
Namespace(cache=False, compile=False, fuse_lora=False, steps=50, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=720, width=720, quantize=False, quantize_type='float8_weight_only', parallel_type='tp', attn=None, perf=False, prompt='A curious raccoon', negative_prompt=None, model_path='/home/lmsys/bbuf/FLUX___1-dev', track_memory=True)
Loading pipeline components...:   0%|                                                                          | 0/7 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading pipeline components...:   0%|                                                                          | 0/7 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 134.21it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers                       | 0/2 [00:00<?, ?it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 133.67it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 62.49it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 14.83it/s]
WARNING 11-20 11:59:45 [cache_interface.py:203] Parallelism is enabled and cache_config is None. Please manually set cache_config to avoid potential compatibility issues. Otherwise, cache will not be enabled.
WARNING 11-20 11:59:45 [cache_interface.py:289] cache_config is None, skip enabling cache for FluxPipeline.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 63.65it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 14.84it/s]
WARNING 11-20 11:59:45 [cache_interface.py:203] Parallelism is enabled and cache_config is None. Please manually set cache_config to avoid potential compatibility issues. Otherwise, cache will not be enabled.
WARNING 11-20 11:59:45 [cache_interface.py:289] cache_config is None, skip enabling cache for FluxPipeline.
INFO 11-20 12:00:07 [parallel_interface.py:45] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2), transformer id:140280598265440
INFO 11-20 12:00:07 [parallel_interface.py:45] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2), transformer id:139812678044032
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:10<00:00,  4.76it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:10<00:00,  4.82it/s]
INFO 11-20 12:00:33 [utils.py:40] Peak GPU memory usage: 21.66 GB
INFO 11-20 12:00:33 [utils.py:40] Peak GPU memory usage: 21.66 GB
WARNING 11-20 12:00:33 [summary.py:275] Can't find Context Options for: FluxTransformer2DModel

🤖Parallelism Config: FluxTransformer2DModel

ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2)
Time cost: 10.47s
Saving image to flux.C0_Q0_NONE_TP2.png
图片

Copy link
Member

@DefTruth DefTruth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM~

@DefTruth DefTruth changed the title [Feature] Cache-DIT example support more overrided args and memory tracker example: support more overrided args and memory tracker Nov 20, 2025
@DefTruth DefTruth merged commit abac1e5 into vipshop:main Nov 20, 2025
@gameofdimension
Copy link
Member

Bro, you're an absolute machine at coding!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants