Skip to content

[model] support FireRedASR2#35727

Merged
vllm-bot merged 18 commits intovllm-project:mainfrom
AllenDou:fireredasr2
Mar 4, 2026
Merged

[model] support FireRedASR2#35727
vllm-bot merged 18 commits intovllm-project:mainfrom
AllenDou:fireredasr2

Conversation

@AllenDou
Copy link
Copy Markdown
Contributor

@AllenDou AllenDou commented Mar 2, 2026

Hi @DarkLight1337 , this PR adds support for the FireRedASR2 model (https://github.com/FireRedTeam/FireRedASR2S) Could you please take a look?

server:
vllm serve allendou/FireRedASR2-LLM-vllm -tp=1 --dtype=float32 , also support -tp=2 --dtype=bfloat16

client:
python3 examples/online_serving/openai_transcription_client.py --repetition_penalty=1.0 --audio_path=/root/hello_zh.wav

result:

transcription result [sync]: 你好世界

transcription result [stream]: 你好世界
[Stream finished reason: stop]

By the way, FireRedASR2 uses a rather unusual encoder multi-head attention class, class RelPosMultiHeadAttention(EncoderMultiHeadAttention). I still haven’t found a better way to replace it with MMEncoderAttention.

also, users could purchase fireredasr2 service from alibaba-pai https://pai.console.aliyun.com/?regionId=cn-hangzhou#/quick-start/models/FireRedASR2-LLM-vllm/intro

@mergify
Copy link
Copy Markdown
Contributor

mergify Bot commented Mar 2, 2026

Documentation preview: https://vllm--35727.org.readthedocs.build/en/35727/

@mergify mergify Bot added documentation Improvements or additions to documentation new-model Requests to new models labels Mar 2, 2026
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request adds support for the FireRedASR2 model. The implementation looks mostly good, but there are a few critical issues that need to be addressed.

  • There is a hardcoded .cuda() call which will break device-agnostic execution.
  • The prompt for transcription is hardcoded to Chinese, which prevents using the model for other languages and ignores user-provided prompts.
  • There is a bug in the processor that breaks batch processing for multiple audio files.
    I've left specific comments with suggestions for fixes.

Comment thread vllm/model_executor/models/fireredasr2.py Outdated
Comment thread vllm/transformers_utils/processors/fireredasr2_processor.py
Comment thread vllm/model_executor/models/fireredasr2.py
@DarkLight1337 DarkLight1337 requested a review from Isotr0py March 2, 2026 09:46
Comment thread vllm/model_executor/models/fireredasr2.py
Comment thread tests/models/registry.py Outdated
Comment thread docs/models/supported_models.md Outdated
Comment thread vllm/model_executor/models/fireredasr2.py Outdated
Comment thread vllm/model_executor/models/fireredasr2.py Outdated
Comment thread vllm/model_executor/models/fireredasr2.py Outdated
Comment thread vllm/model_executor/models/fireredasr2.py Outdated
@AllenDou AllenDou requested a review from Isotr0py March 3, 2026 06:21
@Isotr0py Isotr0py enabled auto-merge (squash) March 3, 2026 09:45
@github-actions github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 3, 2026
@AllenDou
Copy link
Copy Markdown
Contributor Author

AllenDou commented Mar 3, 2026

@Isotr0py Many CI checks have failed. Should I rebase onto main?
and a weird error, "Unable to rebase: Mergify can't impersonate AllenDou".

@DarkLight1337
Copy link
Copy Markdown
Member

I think you need to install kaldi_native_fbank for the test

auto-merge was automatically disabled March 3, 2026 11:16

Head branch was pushed to by a user without write access

@mergify mergify Bot added the ci/build label Mar 3, 2026
zixiao and others added 14 commits March 3, 2026 19:18
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
zixiao added 2 commits March 3, 2026 19:46
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
@AllenDou
Copy link
Copy Markdown
Contributor Author

AllenDou commented Mar 4, 2026

The CI:amd-v1-others-mi325-1 is still failing. How should i handle it? @Isotr0py @DarkLight1337

@vllm-bot vllm-bot merged commit c1d9634 into vllm-project:main Mar 4, 2026
114 of 116 checks passed
@AllenDou
Copy link
Copy Markdown
Contributor Author

AllenDou commented Mar 4, 2026

PR merged, thank you!

@DarkLight1337 DarkLight1337 mentioned this pull request Mar 9, 2026
5 tasks
Copilot AI pushed a commit to machov/vllm that referenced this pull request Mar 10, 2026
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
avinashsingh77 pushed a commit to avinashsingh77/vllm that referenced this pull request Mar 12, 2026
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Virtuoso461
Copy link
Copy Markdown

(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Unexpected error pre-processing request transcribe-b7ab8af624eb4608_1-88994c71
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Traceback (most recent call last):
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1333, in process_input_sockets
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request = self.preprocess_add_request(req)
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 745, in preprocess_add_request
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request.mm_features = self.mm_receiver_cache.get_and_update_features(
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 591, in get_and_update_features
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] feature.data = self.get_and_update_item(feature.data, cache_key)
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 644, in get_and_update_item
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] assert mm_item is not None, f"Expected a cached item for {mm_hash=}"
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] AssertionError: Expected a cached item for mm_hash='c2a2c5bacb156913fd1771ca3976dba30430241c4f71b4e998676b32f85eccf4'

When I use 16 concurrent requests to call vLLM serving FireRedASR2S, I occasionally encounter the following issue: the vLLM health check passes, but requests hang and never receive a response. My setup uses four A100 80GB GPUs. This hanging issue occurs more frequently when the concurrency is set to 12 or above 20, and it becomes relatively less frequent at 16 concurrent requests—though it still happens occasionally even at that level.

@AllenDou
Copy link
Copy Markdown
Contributor Author

AllenDou commented Mar 18, 2026

(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Unexpected error pre-processing request transcribe-b7ab8af624eb4608_1-88994c71 (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Traceback (most recent call last): (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1333, in process_input_sockets (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request = self.preprocess_add_request(req) (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 745, in preprocess_add_request (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request.mm_features = self.mm_receiver_cache.get_and_update_features( (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 591, in get_and_update_features (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] feature.data = self.get_and_update_item(feature.data, cache_key) (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 644, in get_and_update_item (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] assert mm_item is not None, f"Expected a cached item for {mm_hash=}" (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] AssertionError: Expected a cached item for mm_hash='c2a2c5bacb156913fd1771ca3976dba30430241c4f71b4e998676b32f85eccf4'

When I use 16 concurrent requests to call vLLM serving FireRedASR2S, I occasionally encounter the following issue: the vLLM health check passes, but requests hang and never receive a response. My setup uses four A100 80GB GPUs. This hanging issue occurs more frequently when the concurrency is set to 12 or above 20, and it becomes relatively less frequent at 16 concurrent requests—though it still happens occasionally even at that level.

Could you show me your vllm serve command, test script, and some audio files? and vllm version or sha value.

@Virtuoso461
Copy link
Copy Markdown

(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Unexpected error pre-processing request transcribe-b7ab8af624eb4608_1-88994c71 (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Traceback (most recent call last): (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1333, in process_input_sockets (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request = self.preprocess_add_request(req) (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 745, in preprocess_add_request (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request.mm_features = self.mm_receiver_cache.get_and_update_features( (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 591, in get_and_update_features (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] feature.data = self.get_and_update_item(feature.data, cache_key) (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 644, in get_and_update_item (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] assert mm_item is not None, f"Expected a cached item for {mm_hash=}" (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] AssertionError: Expected a cached item for mm_hash='c2a2c5bacb156913fd1771ca3976dba30430241c4f71b4e998676b32f85eccf4'
When I use 16 concurrent requests to call vLLM serving FireRedASR2S, I occasionally encounter the following issue: the vLLM health check passes, but requests hang and never receive a response. My setup uses four A100 80GB GPUs. This hanging issue occurs more frequently when the concurrency is set to 12 or above 20, and it becomes relatively less frequent at 16 concurrent requests—though it still happens occasionally even at that level.

Could you show me your vllm serve command, test script, and some audio files? and vllm version or sha value.

Thank you for your response. Below is the information you requested:

vLLM Serve Command
We are using the following command to serve the FireRedASR2-LLM model:

bash
vllm serve /path/to/FireRedASR2-LLM-vllm -tp=4 --dtype=float32 --mm-processor-cache-gb 0
The key addition is --mm-processor-cache-gb 0, which disables the multi-modal processor cache. We observed a significant improvement in stability after enabling this option—no more crashes or hangs even under high concurrency (tested with hundreds of thousands of inference requests).

vLLM Version
The version we are using is v0.17.1 (confirmed via pip show vllm).
(If you are using a different version, please note that the --mm-processor-cache-gb parameter may behave differently; we recommend checking your version’s documentation.)

Test Script & Audio Files
Unfortunately, due to company policy and data privacy concerns, we cannot share the test script or the audio files used in our experiments. However, we are happy to discuss the methodology or any other technical details that might help you reproduce the issue.

If you need further information or have additional questions, feel free to ask.

@AllenDou
Copy link
Copy Markdown
Contributor Author

@Virtuoso461 What are the min/max/avg durations of your audio files? and Is there anything special about your audio files?

I installed vllm via pip install vllm==0.17.1 and ran:
vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32 (without --mm-processor-cache-gb 0), using 20 concurrent requests for 1 hour.
GPU is L20*2, The test dataset was LibriTTS (38,073 wav files), and everything worked well.

@Virtuoso461
Copy link
Copy Markdown

Both extremely long and extremely short utterances—in either Chinese or English—fail to be recognized correctly.

@Virtuoso461
Copy link
Copy Markdown

你的音频文件的最小/最大/平均时长是多少?你的音频文件有什么特别之处吗?

我通过 vllm 安装了 vllm,并运行了:(没有),同时用了 20 个并发请求,持续了 1 小时。 GPU是L20*2,测试数据集是LibriTTS(38,073个wav文件),一切运行良好。pip install vllm==0.17.1``vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32``--mm-processor-cache-gb 0

Just to add: among the hundreds of thousands of samples, there are a few hundred such problematic ones. When vLLM encounters them, it doesn't report any error but simply hangs. The only way I can avoid this is by setting a timeout. However, even after the timeout, if I subsequently feed normal-length audio, vLLM still does not respond; only the KV cache keeps increasing, and no results are produced. The service only resumes after a restart.

@AllenDou
Copy link
Copy Markdown
Contributor Author

AllenDou commented Mar 18, 2026

你的音频文件的最小/最大/平均时长是多少?你的音频文件有什么特别之处吗?
我通过 vllm 安装了 vllm,并运行了:(没有),同时用了 20 个并发请求,持续了 1 小时。 GPU是L20*2,测试数据集是LibriTTS(38,073个wav文件),一切运行良好。 pip install vllm==0.17.1vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32--mm-processor-cache-gb 0

Just to add: among the hundreds of thousands of samples, there are a few hundred such problematic ones. When vLLM encounters them, it doesn't report any error but simply hangs. The only way I can avoid this is by setting a timeout. However, even after the timeout, if I subsequently feed normal-length audio, vLLM still does not respond; only the KV cache keeps increasing, and no results are produced. The service only resumes after a restart.

Could you share at least one problematic audio file? Hard to reproduce your problem now.
How long and How short?

wendyliu235 pushed a commit to wendyliu235/vllm-public that referenced this pull request Mar 18, 2026
Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>
Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>
Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com>
Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>
@Virtuoso461
Copy link
Copy Markdown

你的音频文件的最小/最大/平均时长是多少?你的音频文件有什么特别之处吗?我通过 vllm 安装了 vllm,并运行了:(没有),同时用了 20 个并发请求,持续了 1 小时。GPU是L20*2,测试数据集是LibriTTS(38,073个wav文件),一切运行良好。vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32pip install vllm==0.17.1``--mm-processor-cache-gb 0

补充一点:在成千上万个样本中,有几百个这样的问题案例。当vLLM遇到它们时,它没有报告任何错误,只是卡住了。我唯一能避免这种情况的办法就是设置一个暂停。然而,即使超时后,如果我随后输入正常长度的音频,vLLM 依然没有响应;只有KV缓存持续增加,且没有结果。服务只有在重启后才能恢复。

你能分享至少一个有问题的音频文件吗?现在很难重现你的问题。多久,多短?

不太方便分享 公司的数据,禁止外传。不过我能告诉你的是,一个句子一旦长了比如文件大小超过50mb,会触发llm的多模态异常,他限制为50mb,我暂时没明白异常哪里产生的,即有可能是你说的请求头的问题。

@Virtuoso461
Copy link
Copy Markdown

你的音频文件的最小/最大/平均时长是多少?你的音频文件有什么特别之处吗?
我通过 vllm 安装了 vllm,并运行了:(没有),同时用了 20 个并发请求,持续了 1 小时。 GPU是L20*2,测试数据集是LibriTTS(38,073个wav文件),一切运行良好。 pip install vllm==0.17.1vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32--mm-processor-cache-gb 0

Just to add: among the hundreds of thousands of samples, there are a few hundred such problematic ones. When vLLM encounters them, it doesn't report any error but simply hangs. The only way I can avoid this is by setting a timeout. However, even after the timeout, if I subsequently feed normal-length audio, vLLM still does not respond; only the KV cache keeps increasing, and no results are produced. The service only resumes after a restart.

Could you share at least one problematic audio file? Hard to reproduce your problem now. How long and How short?

大概的数据就是一堆嗯嗯啊啊之类的语句 我暂时不清楚崩溃原因是什么 只知道崩溃概率很大 如果你的数据集足够具备多样性的话 在过多数据和过于重复过短的输出上容易出问题 并且并发场景也有很大的问题存在

@AllenDou
Copy link
Copy Markdown
Contributor Author

@Virtuoso461 A 50 MB audio file is too large. I think you should process the audio using a VAD model, which splits the audio file into multiple chunks, and then feed these smaller files into the model.

@AllenDou
Copy link
Copy Markdown
Contributor Author

@Virtuoso461 If you have any data privacy concerns, feel free to contact me directly via WeChat/phone at 18621277886.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ci/build documentation Improvements or additions to documentation new-model Requests to new models ready ONLY add when PR is ready to merge/full CI is needed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants