[model] support FireRedASR2#35727
Conversation
|
Documentation preview: https://vllm--35727.org.readthedocs.build/en/35727/ |
There was a problem hiding this comment.
Code Review
This pull request adds support for the FireRedASR2 model. The implementation looks mostly good, but there are a few critical issues that need to be addressed.
- There is a hardcoded
.cuda()call which will break device-agnostic execution. - The prompt for transcription is hardcoded to Chinese, which prevents using the model for other languages and ignores user-provided prompts.
- There is a bug in the processor that breaks batch processing for multiple audio files.
I've left specific comments with suggestions for fixes.
|
@Isotr0py Many CI checks have failed. Should I rebase onto main? |
|
I think you need to install |
Head branch was pushed to by a user without write access
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
The CI:amd-v1-others-mi325-1 is still failing. How should i handle it? @Isotr0py @DarkLight1337 |
|
PR merged, thank you! |
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
|
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Unexpected error pre-processing request transcribe-b7ab8af624eb4608_1-88994c71 When I use 16 concurrent requests to call vLLM serving FireRedASR2S, I occasionally encounter the following issue: the vLLM health check passes, but requests hang and never receive a response. My setup uses four A100 80GB GPUs. This hanging issue occurs more frequently when the concurrency is set to 12 or above 20, and it becomes relatively less frequent at 16 concurrent requests—though it still happens occasionally even at that level. |
Could you show me your vllm serve command, test script, and some audio files? and vllm version or sha value. |
Thank you for your response. Below is the information you requested: vLLM Serve Command bash vLLM Version Test Script & Audio Files If you need further information or have additional questions, feel free to ask. |
|
@Virtuoso461 What are the min/max/avg durations of your audio files? and Is there anything special about your audio files? I installed vllm via |
|
Both extremely long and extremely short utterances—in either Chinese or English—fail to be recognized correctly. |
Just to add: among the hundreds of thousands of samples, there are a few hundred such problematic ones. When vLLM encounters them, it doesn't report any error but simply hangs. The only way I can avoid this is by setting a timeout. However, even after the timeout, if I subsequently feed normal-length audio, vLLM still does not respond; only the KV cache keeps increasing, and no results are produced. The service only resumes after a restart. |
Could you share at least one problematic audio file? Hard to reproduce your problem now. |
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
不太方便分享 公司的数据,禁止外传。不过我能告诉你的是,一个句子一旦长了比如文件大小超过50mb,会触发llm的多模态异常,他限制为50mb,我暂时没明白异常哪里产生的,即有可能是你说的请求头的问题。 |
大概的数据就是一堆嗯嗯啊啊之类的语句 我暂时不清楚崩溃原因是什么 只知道崩溃概率很大 如果你的数据集足够具备多样性的话 在过多数据和过于重复过短的输出上容易出问题 并且并发场景也有很大的问题存在 |
|
@Virtuoso461 A 50 MB audio file is too large. I think you should process the audio using a VAD model, which splits the audio file into multiple chunks, and then feed these smaller files into the model. |
|
@Virtuoso461 If you have any data privacy concerns, feel free to contact me directly via WeChat/phone at 18621277886. |
Hi @DarkLight1337 , this PR adds support for the FireRedASR2 model (https://github.com/FireRedTeam/FireRedASR2S) Could you please take a look?
server:
vllm serve allendou/FireRedASR2-LLM-vllm -tp=1 --dtype=float32, also support -tp=2 --dtype=bfloat16client:
python3 examples/online_serving/openai_transcription_client.py --repetition_penalty=1.0 --audio_path=/root/hello_zh.wavresult:
By the way, FireRedASR2 uses a rather unusual encoder multi-head attention class,
class RelPosMultiHeadAttention(EncoderMultiHeadAttention). I still haven’t found a better way to replace it with MMEncoderAttention.also, users could purchase fireredasr2 service from alibaba-pai https://pai.console.aliyun.com/?regionId=cn-hangzhou#/quick-start/models/FireRedASR2-LLM-vllm/intro