docker run --gpus 'all'  报错，多卡不支持吗

A40双卡服务器，使用GPU部署服务时
docker run --gpus 'all' --shm-size 1g -p 9090:80 -v $HOME/models:/data \
        --env LOG_LEVEL="info,text_generation_router=debug" \
        ghcr.nju.edu.cn/huggingface/text-generation-inference:1.0.3 \
        --model-id /data/CodeShell-7B-Chat --num-shard 1 \
        --max-total-tokens 5000 --max-input-length 4096 \
        --max-stop-sequences 12 --trust-remote-code

报错如下：
024-01-19T08:15:44.995533Z ERROR warmup{max_input_length=4096 max_prefill_tokens=4096}:warmup: text_generation_client: router/client/src/lib.rs:33: Server error: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!
Error: Warmup(Generation("Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!"))
2024-01-19T08:15:45.052858Z ERROR text_generation_launcher: Webserver Crashed
2024-01-19T08:15:45.052873Z  INFO text_generation_launcher: Shutting down shards
2024-01-19T08:15:45.395141Z  INFO shard-manager: text_generation_launcher: Shard terminated rank=0
Error: WebserverFailed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

docker run --gpus 'all' 报错，多卡不支持吗 #53

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

docker run --gpus 'all' 报错，多卡不支持吗 #53

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions