Skip to content

[Bug] SGLang FastWan Model Loading - Parameter block not found in model state dict #902

@faradawn

Description

@faradawn

Describe the bug

SGLang is able to serve other models like Wan. But the FastWan model encountered a model loading issue. It says Parameter blocks.0.to_gate_compress.bias not found in custom model state dict.

sglang serve \
  --model-path FastVideo/FastWan2.1-T2V-1.3B-Diffusers
[11-25 23:45:59] server_args: {"model_path": "FastVideo/FastWan2.1-T2V-1.3B-Diffusers", "attention_backend": null, "mode": "inference", "workload_type": "t2v", "cache_strategy": "none", "distributed_executor_backend": "mp", "nccl_port": null, "trust_remote_code": false, "revision": null, "num_gpus": 1, "tp_size": 1, "sp_degree": 1, "ulysses_degree": 1, "ring_degree": 1, "dp_size": 1, "dp_degree": 1, "enable_cfg_parallel": false, "hsdp_replicate_dim": 1, "hsdp_shard_dim": 1, "dist_timeout": null, "lora_path": null, "lora_nickname": "default", "lora_target_modules": null, "output_type": "pil", "dit_cpu_offload": true, "use_fsdp_inference": false, "text_encoder_cpu_offload": true, "image_encoder_cpu_offload": true, "vae_cpu_offload": true, "pin_cpu_memory": true, "mask_strategy_file_path": null, "STA_mode": "STA_inference", "skip_time_steps": 15, "enable_torch_compile": false, "disable_autocast": false, "VSA_sparsity": 0.0, "moba_config_path": null, "moba_config": {}, "master_port": 30007, "host": null, "port": null, "scheduler_port": 5653, "enable_stage_verification": true, "prompt_file_path": null, "model_paths": {}, "model_loaded": {"transformer": true, "vae": true}, "override_transformer_cls_name": null, "boundary_ratio": null, "log_level": "info"}
[11-25 23:45:59] Starting server...
[11-25 23:46:05] Scheduler bind at endpoint: tcp://localhost:5653
[11-25 23:46:05] Initializing distributed environment with world_size=1, device=cuda:0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
Set TORCH_CUDA_ARCH_LIST to 9.0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[11-25 23:46:06] Downloaded model_index.json for FastVideo/FastWan2.1-T2V-1.3B-Diffusers, pipeline: WanDMDPipeline
[11-25 23:46:06] Loading pipeline modules...
[11-25 23:46:06] Downloading model snapshot from HF Hub for FastVideo/FastWan2.1-T2V-1.3B-Diffusers...
[11-25 23:46:06] Downloaded model to /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476
[11-25 23:46:06] Model path: /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476
[11-25 23:46:06] Diffusers version: 0.33.0.dev0
[11-25 23:46:06] Loading pipeline modules from config: {'_class_name': 'WanDMDPipeline', '_diffusers_version': '0.33.0.dev0', 'scheduler': ['diffusers', 'UniPCMultistepScheduler'], 'text_encoder': ['transformers', 'UMT5EncoderModel'], 'tokenizer': ['transformers', 'T5TokenizerFast'], 'transformer': ['diffusers', 'WanTransformer3DModel'], 'vae': ['diffusers', 'AutoencoderKLWan']}
[11-25 23:46:06] Loading required components: ['text_encoder', 'tokenizer', 'vae', 'transformer', 'scheduler']
Loading required modules:   0%|                                                           | 0/5 [00:00<?, ?it/s][11-25 23:46:06] Loading text_encoder using transformers from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/text_encoder
[11-25 23:46:06] HF model config: {'architectures': ['UMT5EncoderModel'], 'classifier_dropout': 0.0, 'd_ff': 10240, 'd_kv': 64, 'd_model': 4096, 'decoder_start_token_id': 0, 'dense_act_fn': 'gelu_new', 'dropout_rate': 0.1, 'eos_token_id': 1, 'feed_forward_proj': 'gated-gelu', 'initializer_factor': 1.0, 'is_encoder_decoder': True, 'is_gated_act': True, 'layer_norm_epsilon': 1e-06, 'num_decoder_layers': 24, 'num_heads': 64, 'num_layers': 24, 'output_past': True, 'pad_token_id': 0, 'relative_attention_max_distance': 128, 'relative_attention_num_buckets': 32, 'scalable_attention': True, 'tie_word_embeddings': False, 'use_cache': True, 'vocab_size': 256384}

Loading safetensors checkpoint shards:   0% Completed | 0/5 [00:00<?, ?it/s]
Loading safetensors checkpoint shards:  20% Completed | 1/5 [00:00<00:00,  4.89it/s]
Loading safetensors checkpoint shards:  40% Completed | 2/5 [00:00<00:00,  3.19it/s]
Loading safetensors checkpoint shards:  60% Completed | 3/5 [00:01<00:00,  2.71it/s]
Loading safetensors checkpoint shards:  80% Completed | 4/5 [00:01<00:00,  2.60it/s]
Loading safetensors checkpoint shards: 100% Completed | 5/5 [00:01<00:00,  2.64it/s]

[11-25 23:46:08] Loading weights took 1.98 seconds
[11-25 23:46:34] Loaded module text_encoder from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/text_encoder
Loading required modules:  20%|██████████▏                                        | 1/5 [00:27<01:51, 27.86s/it][11-25 23:46:34] Loading tokenizer using transformers from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/tokenizer
[11-25 23:46:34] Loading tokenizer from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/tokenizer
[11-25 23:46:34] Loaded tokenizer: T5TokenizerFast
[11-25 23:46:34] Loaded module tokenizer from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/tokenizer
Loading required modules:  40%|████████████████████▍                              | 2/5 [00:28<00:35, 11.72s/it][11-25 23:46:34] Loading vae using diffusers from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/vae
[11-25 23:46:34] HF model config: {'attn_scales': [], 'base_dim': 96, 'dim_mult': [1, 2, 4, 4], 'dropout': 0.0, 'latents_mean': [-0.7571, -0.7089, -0.9113, 0.1075, -0.1745, 0.9653, -0.1517, 1.5508, 0.4134, -0.0715, 0.5517, -0.3632, -0.1922, -0.9497, 0.2503, -0.2921], 'latents_std': [2.8184, 1.4541, 2.3275, 2.6558, 1.2196, 1.7708, 2.6052, 2.0743, 3.2687, 2.1526, 2.8652, 1.5579, 1.6382, 1.1253, 2.8251, 1.916], 'num_res_blocks': 2, 'temperal_downsample': [False, True, True], 'z_dim': 16}
[11-25 23:46:34] Loaded module vae from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/vae
[11-25 23:46:34] Loading transformer using diffusers from /root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/transformer
[11-25 23:46:34] transformer cls_name: WanTransformer3DModel
[11-25 23:46:34] Loading model from 1 safetensors files: ['/root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/transformer/diffusion_pytorch_model.safetensors']
[11-25 23:46:34] Loading WanTransformer3DModel, default_dtype: torch.bfloat16
[11-25 23:46:34] Using FlashAttention (FA3 for hopper, FA4 for blackwell) backend.

Loading safetensors checkpoint shards: 100% Completed | 1/1 [00:00<00:00, 52.99it/s]

[11-25 23:46:34] Error while loading component: transformer, component_model_path='/root/.cache/huggingface/hub/models--FastVideo--FastWan2.1-T2V-1.3B-Diffusers/snapshots/75640eb8d44c1d5f9dd4c7824ecfb39bf8e4d476/transformer'
Loading required modules:  60%|██████████████████████████████▌                    | 3/5 [00:28<00:18,  9.46s/it]
Process sglang-diffusionWorker-0:
Traceback (most recent call last):
  File "/usr/lib/python3.12/multiprocessing/process.py", line 314, in _bootstrap
    self.run()
  File "/usr/lib/python3.12/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py", line 179, in run_scheduler_process
    scheduler = Scheduler(
                ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/managers/scheduler.py", line 53, in __init__
    worker = GPUWorker(
             ^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py", line 59, in __init__
    self.init_device_and_model()
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/managers/gpu_worker.py", line 88, in init_device_and_model
    self.pipeline = build_pipeline(self.server_args)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/__init__.py", line 52, in build_pipeline
    pipeline = pipeline_cls(model_path, server_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/lora_pipeline.py", line 53, in __init__
    super().__init__(*args, **kwargs)
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py", line 89, in __init__
    self.modules = self.load_modules(server_args, loaded_modules)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/pipelines_core/composed_pipeline_base.py", line 302, in load_modules
    module = PipelineComponentLoader.load_module(
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/component_loader.py", line 684, in load_module
    raise e
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/component_loader.py", line 679, in load_module
    return loader.load(component_model_path, server_args, module_name)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/component_loader.py", line 555, in load
    model = maybe_load_fsdp_model(
            ^^^^^^^^^^^^^^^^^^^^^^
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/fsdp_load.py", line 136, in maybe_load_fsdp_model
    load_model_from_full_model_state_dict(
  File "/sgl-workspace/sglang/python/sglang/multimodal_gen/runtime/loader/fsdp_load.py", line 257, in load_model_from_full_model_state_dict
    raise ValueError(
ValueError: Parameter blocks.0.to_gate_compress.bias not found in custom model state dict. The hf to custom mapping may be incorrect.

Reproduction

# pull the sglang dev container
docker run -itd --gpus all \
    --shm-size 32g \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=xxx" \
    --ipc=host \
    --network=host \
    --privileged \
    --name sglang_dev \
    lmsysorg/sglang:dev bash

# go into the container 
docker exec -it sglang_dev bash

# run the fast wan model
sglang serve --model-path FastVideo/FastWan2.1-T2V-1.3B-Diffusers

# error will appear

Environment

Single H200 GPU.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions