example: support more overrided args and memory tracker #461

BBuf · 2025-11-20T10:15:08Z

This PR adds flexible command-line arguments and GPU memory tracking capabilities to all example scripts, making them more convenient for testing and benchmarking.

Customizable Prompts
- --prompt: Override the default prompt in examples
- --negative-prompt: Override the default negative prompt in examples
Flexible Model Path
- --model-path: Override the model loading path, useful for testing different model versions or local checkpoints
Memory Tracking
- --track-memory: Track and report peak GPU memory usage during inference
- Implemented via a new MemoryTracker context manager class in utils.py

How to use?

Flux.1.dev H100 1GPU with custom model path, custom prompt and memory tracker

python3 run_flux.py --height 720 --width 720 --steps 50 --cache --compile --prompt "A curious raccoon" --model-path /home/lmsys/bbuf/FLUX___1-dev --track-memory
WARNING 11-20 10:21:50 [pattern_base.py:25] Context parallelism requires the 'diffusers>=0.36.dev0'.Please install latest version of diffusers from source: 
WARNING 11-20 10:21:50 [pattern_base.py:25] pip3 install git+https://github.com/huggingface/diffusers.git
Namespace(cache=True, compile=True, fuse_lora=False, steps=50, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=720, width=720, quantize=False, quantize_type='float8_weight_only', parallel_type=None, attn=None, perf=False, prompt='A curious raccoon', negative_prompt=None, model_path='/home/lmsys/bbuf/FLUX___1-dev', track_memory=True)
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 60.12it/s]
Loading pipeline components...:  14%|█████████▍                                                        | 1/7 [00:00<00:01,  5.73it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 138.02it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers                       | 0/2 [00:00<?, ?it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 14.90it/s]
INFO 11-20 10:21:50 [cache_adapter.py:49] FluxPipeline is officially supported by cache-dit. Use it's pre-defined BlockAdapter directly!
INFO 11-20 10:21:50 [functor_flux.py:61] Applied FluxPatchFunctor for FluxTransformer2DModel, Patch: False.
INFO 11-20 10:21:50 [block_adapters.py:147] Found transformer from diffusers: diffusers.models.transformers.transformer_flux enable check_forward_pattern by default.
INFO 11-20 10:21:50 [block_adapters.py:494] Match Block Forward Pattern: ['FluxTransformerBlock', 'FluxSingleTransformerBlock'], ForwardPattern.Pattern_1
INFO 11-20 10:21:50 [block_adapters.py:494] IN:('hidden_states', 'encoder_hidden_states'), OUT:('encoder_hidden_states', 'hidden_states'))
INFO 11-20 10:21:50 [cache_adapter.py:142] Use default 'enable_separate_cfg' from block adapter register: False, Pipeline: FluxPipeline.
INFO 11-20 10:21:50 [cache_adapter.py:307] Collected Context Config: DBCache_F8B0_W8I1M0MC0_R0.08, Calibrator Config: None
INFO 11-20 10:21:50 [pattern_base.py:70] Match Blocks: CachedBlocks_Pattern_0_1_2, for transformer_blocks, cache_context: transformer_blocks_140396272493024, context_manager: FluxPipeline_140396197738592.
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:05<00:00,  9.27it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:02<00:00, 18.63it/s]
INFO 11-20 10:22:20 [utils.py:40] Peak GPU memory usage: 32.68 GB
WARNING 11-20 10:22:20 [summary.py:275] Can't find Context Options for: FluxSingleTransformerBlock
WARNING 11-20 10:22:20 [summary.py:284] Can't find Parallelism Config for: FluxSingleTransformerBlock
WARNING 11-20 10:22:20 [summary.py:275] Can't find Context Options for: FluxTransformerBlock
WARNING 11-20 10:22:20 [summary.py:284] Can't find Parallelism Config for: FluxTransformerBlock

🤗Context Options: OptimizedModule

{'cache_config': DBCacheConfig(cache_type=<CacheType.DBCache: 'DBCache'>, Fn_compute_blocks=8, Bn_compute_blocks=0, residual_diff_threshold=0.08, max_accumulated_residual_diff_threshold=None, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, enable_separate_cfg=False, cfg_compute_first=False, cfg_diff_compute_separate=True, num_inference_steps=None, steps_computation_mask=None, steps_computation_policy='dynamic'), 'name': 'transformer_blocks_140396272493024'}
WARNING 11-20 10:22:20 [summary.py:284] Can't find Parallelism Config for: OptimizedModule

⚡️Cache Steps and Residual Diffs Statistics: OptimizedModule

| Cache Steps | Diffs P00 | Diffs P25 | Diffs P50 | Diffs P75 | Diffs P95 | Diffs Min | Diffs Max |
|-------------|-----------|-----------|-----------|-----------|-----------|-----------|-----------|
| 24          | 0.027     | 0.047     | 0.073     | 0.099     | 0.14      | 0.027     | 0.174     |

Time cost: 2.82s
Saving image to flux.C1_Q0_DBCache_F8B0_W8I1M0MC0_R0.08_T0O0_S24.png
➜  pipeline

Flux.1.dev tp2 parallel

 parallelism torchrun --nproc_per_node=2 run_flux_tp.py --parallel tp --height 720 --width 720 --steps 50 --prompt "A curious raccoon"  --model-path /home/lmsys/bbuf/FLUX___1-dev --track-memory   
W1120 11:59:37.958000 65189 torch/distributed/run.py:774] 
W1120 11:59:37.958000 65189 torch/distributed/run.py:774] *****************************************
W1120 11:59:37.958000 65189 torch/distributed/run.py:774] Setting OMP_NUM_THREADS environment variable for each process to be 1 in default, to avoid your system being overloaded, please further tune the variable for optimal performance in your application as needed. 
W1120 11:59:37.958000 65189 torch/distributed/run.py:774] *****************************************
WARNING 11-20 11:59:44 [pattern_base.py:25] Context parallelism requires the 'diffusers>=0.36.dev0'.Please install latest version of diffusers from source: 
WARNING 11-20 11:59:44 [pattern_base.py:25] pip3 install git+https://github.com/huggingface/diffusers.git
Namespace(cache=False, compile=False, fuse_lora=False, steps=50, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=720, width=720, quantize=False, quantize_type='float8_weight_only', parallel_type='tp', attn=None, perf=False, prompt='A curious raccoon', negative_prompt=None, model_path='/home/lmsys/bbuf/FLUX___1-dev', track_memory=True)
WARNING 11-20 11:59:44 [pattern_base.py:25] Context parallelism requires the 'diffusers>=0.36.dev0'.Please install latest version of diffusers from source: 
WARNING 11-20 11:59:44 [pattern_base.py:25] pip3 install git+https://github.com/huggingface/diffusers.git
Namespace(cache=False, compile=False, fuse_lora=False, steps=50, Fn=8, Bn=0, rdt=0.08, max_warmup_steps=8, warmup_interval=1, max_cached_steps=-1, max_continuous_cached_steps=-1, taylorseer=False, taylorseer_order=1, height=720, width=720, quantize=False, quantize_type='float8_weight_only', parallel_type='tp', attn=None, perf=False, prompt='A curious raccoon', negative_prompt=None, model_path='/home/lmsys/bbuf/FLUX___1-dev', track_memory=True)
Loading pipeline components...:   0%|                                                                          | 0/7 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading pipeline components...:   0%|                                                                          | 0/7 [00:00<?, ?it/s]`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 134.21it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers                       | 0/2 [00:00<?, ?it/s]
You set `add_prefix_space`. The tokenizer needs to be converted from the slow tokenizers
Loading checkpoint shards: 100%|██████████████████████████████████████████████████████████████████████| 2/2 [00:00<00:00, 133.67it/s]
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 62.49it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 14.83it/s]
WARNING 11-20 11:59:45 [cache_interface.py:203] Parallelism is enabled and cache_config is None. Please manually set cache_config to avoid potential compatibility issues. Otherwise, cache will not be enabled.
WARNING 11-20 11:59:45 [cache_interface.py:289] cache_config is None, skip enabling cache for FluxPipeline.
Loading checkpoint shards: 100%|███████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 63.65it/s]
Loading pipeline components...: 100%|██████████████████████████████████████████████████████████████████| 7/7 [00:00<00:00, 14.84it/s]
WARNING 11-20 11:59:45 [cache_interface.py:203] Parallelism is enabled and cache_config is None. Please manually set cache_config to avoid potential compatibility issues. Otherwise, cache will not be enabled.
WARNING 11-20 11:59:45 [cache_interface.py:289] cache_config is None, skip enabling cache for FluxPipeline.
INFO 11-20 12:00:07 [parallel_interface.py:45] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2), transformer id:140280598265440
INFO 11-20 12:00:07 [parallel_interface.py:45] Enabled parallelism: ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2), transformer id:139812678044032
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:10<00:00,  4.76it/s]
100%|████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:10<00:00,  4.82it/s]
INFO 11-20 12:00:33 [utils.py:40] Peak GPU memory usage: 21.66 GB
INFO 11-20 12:00:33 [utils.py:40] Peak GPU memory usage: 21.66 GB
WARNING 11-20 12:00:33 [summary.py:275] Can't find Context Options for: FluxTransformer2DModel

🤖Parallelism Config: FluxTransformer2DModel

ParallelismConfig(backend=ParallelismBackend.NATIVE_PYTORCH, ulysses_size=None, ring_size=None, tp_size=2)
Time cost: 10.47s
Saving image to flux.C0_Q0_NONE_TP2.png

DefTruth

LGTM~

gameofdimension · 2025-11-20T15:42:12Z

Bro, you're an absolute machine at coding!

BBuf added 2 commits November 20, 2025 18:12

more override args and memory tracker

c068061

fix

e30f395

DefTruth approved these changes Nov 20, 2025

View reviewed changes

fix

767cf78

DefTruth changed the title ~~[Feature] Cache-DIT example support more overrided args and memory tracker~~ example: support more overrided args and memory tracker Nov 20, 2025

DefTruth merged commit abac1e5 into vipshop:main Nov 20, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

example: support more overrided args and memory tracker #461

example: support more overrided args and memory tracker #461

Uh oh!

BBuf commented Nov 20, 2025 •

edited

Loading

Uh oh!

DefTruth left a comment

Uh oh!

gameofdimension commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

example: support more overrided args and memory tracker #461

example: support more overrided args and memory tracker #461

Uh oh!

Conversation

BBuf commented Nov 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

How to use?

Flux.1.dev H100 1GPU with custom model path, custom prompt and memory tracker

Flux.1.dev tp2 parallel

Uh oh!

DefTruth left a comment

Choose a reason for hiding this comment

Uh oh!

gameofdimension commented Nov 20, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

BBuf commented Nov 20, 2025 •

edited

Loading