[model] support FireRedASR2 by AllenDou · Pull Request #35727 · vllm-project/vllm

AllenDou · 2026-03-02T09:39:13Z

Hi @DarkLight1337 , this PR adds support for the FireRedASR2 model (https://github.com/FireRedTeam/FireRedASR2S) Could you please take a look?

server:
vllm serve allendou/FireRedASR2-LLM-vllm -tp=1 --dtype=float32 , also support -tp=2 --dtype=bfloat16

client:
python3 examples/online_serving/openai_transcription_client.py --repetition_penalty=1.0 --audio_path=/root/hello_zh.wav

result:

transcription result [sync]: 你好世界

transcription result [stream]: 你好世界
[Stream finished reason: stop]

By the way, FireRedASR2 uses a rather unusual encoder multi-head attention class, class RelPosMultiHeadAttention(EncoderMultiHeadAttention). I still haven’t found a better way to replace it with MMEncoderAttention.

also, users could purchase fireredasr2 service from alibaba-pai https://pai.console.aliyun.com/?regionId=cn-hangzhou#/quick-start/models/FireRedASR2-LLM-vllm/intro

mergify · 2026-03-02T09:40:04Z

Documentation preview: https://vllm--35727.org.readthedocs.build/en/35727/

gemini-code-assist

Code Review

This pull request adds support for the FireRedASR2 model. The implementation looks mostly good, but there are a few critical issues that need to be addressed.

There is a hardcoded .cuda() call which will break device-agnostic execution.
The prompt for transcription is hardcoded to Chinese, which prevents using the model for other languages and ignores user-provided prompts.
There is a bug in the processor that breaks batch processing for multiple audio files.
I've left specific comments with suggestions for fixes.

AllenDou · 2026-03-03T11:10:08Z

@Isotr0py Many CI checks have failed. Should I rebase onto main?
and a weird error, "Unable to rebase: Mergify can't impersonate AllenDou".

DarkLight1337 · 2026-03-03T11:12:33Z

I think you need to install kaldi_native_fbank for the test

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

AllenDou · 2026-03-04T00:48:50Z

The CI:amd-v1-others-mi325-1 is still failing. How should i handle it? @Isotr0py @DarkLight1337

AllenDou · 2026-03-04T03:52:40Z

PR merged, thank you!

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Virtuoso461 · 2026-03-18T08:54:07Z

(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Unexpected error pre-processing request transcribe-b7ab8af624eb4608_1-88994c71
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Traceback (most recent call last):
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1333, in process_input_sockets
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request = self.preprocess_add_request(req)
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 745, in preprocess_add_request
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request.mm_features = self.mm_receiver_cache.get_and_update_features(
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 591, in get_and_update_features
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] feature.data = self.get_and_update_item(feature.data, cache_key)
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 644, in get_and_update_item
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] assert mm_item is not None, f"Expected a cached item for {mm_hash=}"
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^
(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] AssertionError: Expected a cached item for mm_hash='c2a2c5bacb156913fd1771ca3976dba30430241c4f71b4e998676b32f85eccf4'

When I use 16 concurrent requests to call vLLM serving FireRedASR2S, I occasionally encounter the following issue: the vLLM health check passes, but requests hang and never receive a response. My setup uses four A100 80GB GPUs. This hanging issue occurs more frequently when the concurrency is set to 12 or above 20, and it becomes relatively less frequent at 16 concurrent requests—though it still happens occasionally even at that level.

AllenDou · 2026-03-18T09:00:22Z

(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Unexpected error pre-processing request transcribe-b7ab8af624eb4608_1-88994c71 (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Traceback (most recent call last): (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1333, in process_input_sockets (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request = self.preprocess_add_request(req) (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 745, in preprocess_add_request (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request.mm_features = self.mm_receiver_cache.get_and_update_features( (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 591, in get_and_update_features (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] feature.data = self.get_and_update_item(feature.data, cache_key) (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 644, in get_and_update_item (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] assert mm_item is not None, f"Expected a cached item for {mm_hash=}" (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] AssertionError: Expected a cached item for mm_hash='c2a2c5bacb156913fd1771ca3976dba30430241c4f71b4e998676b32f85eccf4'

When I use 16 concurrent requests to call vLLM serving FireRedASR2S, I occasionally encounter the following issue: the vLLM health check passes, but requests hang and never receive a response. My setup uses four A100 80GB GPUs. This hanging issue occurs more frequently when the concurrency is set to 12 or above 20, and it becomes relatively less frequent at 16 concurrent requests—though it still happens occasionally even at that level.

Could you show me your vllm serve command, test script, and some audio files? and vllm version or sha value.

Virtuoso461 · 2026-03-18T09:08:12Z

(EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Unexpected error pre-processing request transcribe-b7ab8af624eb4608_1-88994c71 (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] Traceback (most recent call last): (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 1333, in process_input_sockets (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request = self.preprocess_add_request(req) (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/v1/engine/core.py", line 745, in preprocess_add_request (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] request.mm_features = self.mm_receiver_cache.get_and_update_features( (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 591, in get_and_update_features (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] feature.data = self.get_and_update_item(feature.data, cache_key) (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] File "/root/miniforge3/envs/qiu/lib/python3.12/site-packages/vllm/multimodal/cache.py", line 644, in get_and_update_item (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] assert mm_item is not None, f"Expected a cached item for {mm_hash=}" (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] ^^^^^^^^^^^^^^^^^^^ (EngineCore_DP0 pid=51637) ERROR 03-18 16:48:41 [core.py:1424] AssertionError: Expected a cached item for mm_hash='c2a2c5bacb156913fd1771ca3976dba30430241c4f71b4e998676b32f85eccf4'
When I use 16 concurrent requests to call vLLM serving FireRedASR2S, I occasionally encounter the following issue: the vLLM health check passes, but requests hang and never receive a response. My setup uses four A100 80GB GPUs. This hanging issue occurs more frequently when the concurrency is set to 12 or above 20, and it becomes relatively less frequent at 16 concurrent requests—though it still happens occasionally even at that level.

Could you show me your vllm serve command, test script, and some audio files? and vllm version or sha value.

Thank you for your response. Below is the information you requested:

vLLM Serve Command
We are using the following command to serve the FireRedASR2-LLM model:

bash
vllm serve /path/to/FireRedASR2-LLM-vllm -tp=4 --dtype=float32 --mm-processor-cache-gb 0
The key addition is --mm-processor-cache-gb 0, which disables the multi-modal processor cache. We observed a significant improvement in stability after enabling this option—no more crashes or hangs even under high concurrency (tested with hundreds of thousands of inference requests).

vLLM Version
The version we are using is v0.17.1 (confirmed via pip show vllm).
(If you are using a different version, please note that the --mm-processor-cache-gb parameter may behave differently; we recommend checking your version’s documentation.)

Test Script & Audio Files
Unfortunately, due to company policy and data privacy concerns, we cannot share the test script or the audio files used in our experiments. However, we are happy to discuss the methodology or any other technical details that might help you reproduce the issue.

If you need further information or have additional questions, feel free to ask.

AllenDou · 2026-03-18T11:03:09Z

@Virtuoso461 What are the min/max/avg durations of your audio files? and Is there anything special about your audio files?

I installed vllm via pip install vllm==0.17.1 and ran:
vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32 (without --mm-processor-cache-gb 0), using 20 concurrent requests for 1 hour.
GPU is L20*2, The test dataset was LibriTTS (38,073 wav files), and everything worked well.

Virtuoso461 · 2026-03-18T12:21:14Z

Both extremely long and extremely short utterances—in either Chinese or English—fail to be recognized correctly.

Virtuoso461 · 2026-03-18T12:23:02Z

你的音频文件的最小/最大/平均时长是多少？你的音频文件有什么特别之处吗？

我通过 vllm 安装了 vllm，并运行了：（没有），同时用了 20 个并发请求，持续了 1 小时。 GPU是L20*2，测试数据集是LibriTTS（38,073个wav文件），一切运行良好。pip install vllm==0.17.1``vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32``--mm-processor-cache-gb 0

Just to add: among the hundreds of thousands of samples, there are a few hundred such problematic ones. When vLLM encounters them, it doesn't report any error but simply hangs. The only way I can avoid this is by setting a timeout. However, even after the timeout, if I subsequently feed normal-length audio, vLLM still does not respond; only the KV cache keeps increasing, and no results are produced. The service only resumes after a restart.

AllenDou · 2026-03-18T12:31:31Z

你的音频文件的最小/最大/平均时长是多少？你的音频文件有什么特别之处吗？
我通过 vllm 安装了 vllm，并运行了：（没有），同时用了 20 个并发请求，持续了 1 小时。 GPU是L20*2，测试数据集是LibriTTS（38,073个wav文件），一切运行良好。 pip install vllm==0.17.1vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32--mm-processor-cache-gb 0

Just to add: among the hundreds of thousands of samples, there are a few hundred such problematic ones. When vLLM encounters them, it doesn't report any error but simply hangs. The only way I can avoid this is by setting a timeout. However, even after the timeout, if I subsequently feed normal-length audio, vLLM still does not respond; only the KV cache keeps increasing, and no results are produced. The service only resumes after a restart.

Could you share at least one problematic audio file? Hard to reproduce your problem now.
How long and How short?

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com> Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn> Co-authored-by: zixiao <shunli.dsl@alibaba-inc.com> Co-authored-by: Isotr0py <mozf@mail2.sysu.edu.cn>

Virtuoso461 · 2026-03-18T17:34:03Z

你的音频文件的最小/最大/平均时长是多少？你的音频文件有什么特别之处吗？我通过 vllm 安装了 vllm，并运行了：（没有），同时用了 20 个并发请求，持续了 1 小时。GPU是L20*2，测试数据集是LibriTTS（38,073个wav文件），一切运行良好。vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32pip install vllm==0.17.1``--mm-processor-cache-gb 0

补充一点：在成千上万个样本中，有几百个这样的问题案例。当vLLM遇到它们时，它没有报告任何错误，只是卡住了。我唯一能避免这种情况的办法就是设置一个暂停。然而，即使超时后，如果我随后输入正常长度的音频，vLLM 依然没有响应;只有KV缓存持续增加，且没有结果。服务只有在重启后才能恢复。

你能分享至少一个有问题的音频文件吗？现在很难重现你的问题。多久，多短？

不太方便分享公司的数据，禁止外传。不过我能告诉你的是，一个句子一旦长了比如文件大小超过50mb，会触发llm的多模态异常，他限制为50mb，我暂时没明白异常哪里产生的，即有可能是你说的请求头的问题。

Virtuoso461 · 2026-03-19T03:52:56Z

你的音频文件的最小/最大/平均时长是多少？你的音频文件有什么特别之处吗？
我通过 vllm 安装了 vllm，并运行了：（没有），同时用了 20 个并发请求，持续了 1 小时。 GPU是L20*2，测试数据集是LibriTTS（38,073个wav文件），一切运行良好。 pip install vllm==0.17.1vllm serve allendou/FireRedASR2-LLM-vllm -tp=2 --dtype=float32--mm-processor-cache-gb 0

Just to add: among the hundreds of thousands of samples, there are a few hundred such problematic ones. When vLLM encounters them, it doesn't report any error but simply hangs. The only way I can avoid this is by setting a timeout. However, even after the timeout, if I subsequently feed normal-length audio, vLLM still does not respond; only the KV cache keeps increasing, and no results are produced. The service only resumes after a restart.

Could you share at least one problematic audio file? Hard to reproduce your problem now. How long and How short?

大概的数据就是一堆嗯嗯啊啊之类的语句我暂时不清楚崩溃原因是什么只知道崩溃概率很大如果你的数据集足够具备多样性的话在过多数据和过于重复过短的输出上容易出问题并且并发场景也有很大的问题存在

AllenDou · 2026-03-19T04:01:21Z

@Virtuoso461 A 50 MB audio file is too large. I think you should process the audio using a VAD model, which splits the audio file into multiple chunks, and then feed these smaller files into the model.

AllenDou · 2026-03-19T04:30:14Z

@Virtuoso461 If you have any data privacy concerns, feel free to contact me directly via WeChat/phone at 18621277886.

AllenDou requested review from DarkLight1337 and ywang96 as code owners March 2, 2026 09:39

mergify Bot added documentation Improvements or additions to documentation new-model Requests to new models labels Mar 2, 2026

gemini-code-assist Bot reviewed Mar 2, 2026

View reviewed changes

Comment thread vllm/model_executor/models/fireredasr2.py Outdated

Comment thread vllm/transformers_utils/processors/fireredasr2_processor.py

Comment thread vllm/model_executor/models/fireredasr2.py

DarkLight1337 requested a review from Isotr0py March 2, 2026 09:46

Isotr0py reviewed Mar 2, 2026

View reviewed changes

AllenDou requested a review from Isotr0py March 3, 2026 06:21

Isotr0py approved these changes Mar 3, 2026

View reviewed changes

Isotr0py enabled auto-merge (squash) March 3, 2026 09:45

github-actions Bot added the ready ONLY add when PR is ready to merge/full CI is needed label Mar 3, 2026

auto-merge was automatically disabled March 3, 2026 11:16
Head branch was pushed to by a user without write access

mergify Bot added the ci/build label Mar 3, 2026

zixiao and others added 14 commits March 3, 2026 19:18

[model] support FireRedASR2.

6af5184

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

330149d

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

7cd47e0

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

1db197d

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

8f61d59

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

87291b3

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

5f1d137

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

06aedeb

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

b454dd6

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

9da5716

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

5fff410

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

682d40c

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

fe5f52f

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

add missing annotations

12f8ea5

Signed-off-by: Isotr0py <mozf@mail2.sysu.edu.cn>

zixiao added 2 commits March 3, 2026 19:46

update.

e085961

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

update.

e7250fd

Signed-off-by: zixiao <shunli.dsl@alibaba-inc.com>

vllm-bot merged commit c1d9634 into vllm-project:main Mar 4, 2026
114 of 116 checks passed

AllenDou mentioned this pull request Mar 5, 2026

Serving FireRedASR2-LLM with vllm FireRedTeam/FireRedASR2S#31

Merged

DarkLight1337 mentioned this pull request Mar 9, 2026

add firered audio model #32586

Open

5 tasks

Uh oh!

Conversation

AllenDou commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mergify Bot commented Mar 2, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

AllenDou commented Mar 3, 2026

Uh oh!

DarkLight1337 commented Mar 3, 2026

Uh oh!

AllenDou commented Mar 4, 2026

Uh oh!

Uh oh!

AllenDou commented Mar 4, 2026

Uh oh!

Virtuoso461 commented Mar 18, 2026

Uh oh!

AllenDou commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Virtuoso461 commented Mar 18, 2026

Uh oh!

AllenDou commented Mar 18, 2026

Uh oh!

Virtuoso461 commented Mar 18, 2026

Uh oh!

Virtuoso461 commented Mar 18, 2026

Uh oh!

AllenDou commented Mar 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Virtuoso461 commented Mar 18, 2026

Uh oh!

Virtuoso461 commented Mar 19, 2026

Uh oh!

AllenDou commented Mar 19, 2026

Uh oh!

AllenDou commented Mar 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

AllenDou commented Mar 2, 2026 •

edited

Loading

AllenDou commented Mar 18, 2026 •

edited

Loading

AllenDou commented Mar 18, 2026 •

edited

Loading