-
Notifications
You must be signed in to change notification settings - Fork 216
Open
Description
Describe the bug
SGLang is able to serve other models like Wan. But the FastWan model encountered a model loading issue. It says Parameter blocks.0.to_gate_compress.bias not found in custom model state dict.
sglang serve \
--model-path FastVideo/FastWan2.1-T2V-1.3B-Diffusers
[11-25 23:45:59] server_args: {"model_path": "FastVideo/FastWan2.1-T2V-1.3B-Diffusers", "attention_backend": null, "mode": "inference", "workload_type": "t2v", "cache_strategy": "none", "distributed_executor_backend": "mp", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "lora_target_modules": null, "output_type": "pil", "dit_cpu_offload": true, "use_fsdp_inference": false, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30007, "host": null, "port": null, "scheduler_port": 5653, "enable_stage_verification": true, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "override_transformer_cls_name": null, "boundary_ratio": null, "log_level": "info"}
[11-25 23:45:59] Starting server...
[11-25 23:46:05] Scheduler bind at endpoint: tcp://localhost:5653
[11-25 23:46:05] Initializing distributed environment with world_size=1, device=cuda:0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Set TORCH_CUDA_ARCH_LIST to 9.0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[11-25 23:46:06] Downloaded model_index.json for FastVideo/FastWan2.1-T2V-1.3B-Diffusers, pipeline: WanDMDPipeline
[11-25 23:46:06] Loading pipeline modules...
[11-25 23:46:06] Downloading model snapshot from HF Hub for FastVideo/FastWan2.1-T2V-1.3B-Diffusers...
[11-25 23:46:06] Downloaded model to /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476
[11-25 23:46:06] Model path: /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476
[11-25 23:46:06] Diffusers version: 0.33.0.dev0
[11-25 23:46:06] Loading pipeline modules from config: {'_class_name': 'WanDMDPipeline', '_diffusers_version': '0.33.0.dev0', 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[11-25 23:46:06] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler']
Loading required modules: 0%| | 0/5 [00:00<?, ?it/s][11-25 23:46:06] Loading text_encoder using transformers from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/text_encoder
[11-25 23:46:06] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}
Loading safetensors checkpoint shards: 0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards: 20% Completed | 1/5 [00:00<00:00, 4.89it/s]
Loading safetensors checkpoint shards: 40% Completed | 2/5 [00:00<00:00, 3.19it/s]
Loading safetensors checkpoint shards: 60% Completed | 3/5 [00:01<00:00, 2.71it/s]
Loading safetensors checkpoint shards: 80% Completed | 4/5 [00:01<00:00, 2.60it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00, 2.64it/s]
[11-25 23:46:08] Loading weights took 1.98 seconds
[11-25 23:46:34] Loaded module text_encoder from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/text_encoder
Loading required modules: 20%|██████████▏ | 1/5 [00:27<01:51, 27.86s/it][11-25 23:46:34] Loading tokenizer using transformers from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/tokenizer
[11-25 23:46:34] Loading tokenizer from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/tokenizer
[11-25 23:46:34] Loaded tokenizer: T5TokenizerFast
[11-25 23:46:34] Loaded module tokenizer from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/tokenizer
Loading required modules: 40%|████████████████████▍ | 2/5 [00:28<00:35, 11.72s/it][11-25 23:46:34] Loading vae using diffusers from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/vae
[11-25 23:46:34] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[11-25 23:46:34] Loaded module vae from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/vae
[11-25 23:46:34] Loading transformer using diffusers from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/transformer
[11-25 23:46:34] transformer cls_name: WanTransformer3DModel
[11-25 23:46:34] Loading model from 1 safetensors files: ['/root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/transformer/diffusion_pytorch_model.safetensors']
[11-25 23:46:34] Loading WanTransformer3DModel, default_dtype: torch.bfloat16
[11-25 23:46:34] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend.
Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 52.99it/s]
[11-25 23:46:34] Error while loading component: transformer, component_model_path='/root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/transformer'
Loading required modules: 60%|██████████████████████████████▌ | 3/5 [00:28<00:18, 9.46s/it]
Process sglang-diffusionWorker-0:
Traceback (most recent call last):
File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
self.run()
File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py", line 179, in run_scheduler_process
scheduler = Scheduler(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/managers/scheduler.py", line 53, in __init__
worker = GPUWorker(
^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py", line 59, in __init__
self.init_device_and_model()
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py", line 88, in init_device_and_model
self.pipeline = build_pipeline(self.server_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/__init__.py", line 52, in build_pipeline
pipeline = pipeline_cls(model_path, server_args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/lora_pipeline.py", line 53, in __init__
super().__init__(*args, **kwargs)
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py", line 89, in __init__
self.modules = self.load_modules(server_args, loaded_modules)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py", line 302, in load_modules
module = PipelineComponentLoader.load_module(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/component_loader.py", line 684, in load_module
raise e
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/component_loader.py", line 679, in load_module
return loader.load(component_model_path, server_args, module_name)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/component_loader.py", line 555, in load
model = maybe_load_fsdp_model(
^^^^^^^^^^^^^^^^^^^^^^
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/fsdp_load.py", line 136, in maybe_load_fsdp_model
load_model_from_full_model_state_dict(
File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/fsdp_load.py", line 257, in load_model_from_full_model_state_dict
raise ValueError(
ValueError: Parameter blocks.0.to_gate_compress.bias not found in custom model state dict. The hf to custom mapping may be incorrect.
Reproduction
# pull the sglang dev container
docker run -itd --gpus all \
--shm-size 32g \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--env "HF_TOKEN=xxx" \
--ipc=host \
--network=host \
--privileged \
--name sglang_dev \
lmsysorg/sglang:dev bash
# go into the container
docker exec -it sglang_dev bash
# run the fast wan model
sglang serve --model-path FastVideo/FastWan2.1-T2V-1.3B-Diffusers
# error will appear
Environment
Single H200 GPU.
Metadata
Metadata
Assignees
Labels
No labels