Skip to content

Commit 306c935

Browse files
committed
update asr(en)
1 parent a11f03f commit 306c935

File tree

1 file changed

+147
-36
lines changed
  • i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend

1 file changed

+147
-36
lines changed

i18n/en/docusaurus-plugin-content-docs/current/user-guide/backend/asr.md

Lines changed: 147 additions & 36 deletions
Original file line numberDiff line numberDiff line change
@@ -6,70 +6,104 @@ import TabItem from '@theme/TabItem';
66

77
# Speech Recognition (ASR)
88

9-
Speech Recognition (ASR, Automatic Speech Recognition) converts user's speech into text. This project supports the implementation of multiple speech recognition models.
9+
Speech Recognition (ASR, Automatic Speech Recognition) converts user speech to text. This project supports multiple speech recognition model implementations.
1010

11-
Speech recognition related configuration items are under `asr_config` in `conf.yaml`.
11+
ASR-related configuration items are under `asr_config` in `conf.yaml`.
1212

1313
Here are the speech recognition options you can choose from:
1414

1515
## `sherpa_onnx_asr` (Local & Project Default)
1616

1717
:::note
18-
(Added in `v0.5.0-alpha.1` version's PR: [Add sherpa-onnx support #50](https://github.com/t41372/Open-LLM-VTuber/pull/50))
18+
(Added in `v0.5.0-alpha.1` PR: [Add sherpa-onnx support #50](https://github.com/t41372/Open-LLM-VTuber/pull/50))
1919
:::
2020

2121
[sherpa-onnx](https://github.com/k2-fsa/sherpa-onnx) is a feature-rich inference tool that can run various speech recognition (ASR) models.
2222

2323
:::info
24-
Starting from version `v1.0.0`, this project uses `sherpa-onnx` to run the `SenseVoiceSmall` (int8 quantized) model as the default speech recognition solution. This is an out-of-the-box configuration - you don't need to do any additional setup, the system will automatically download the model files and extract them to the project's `models` directory on first run.
24+
Starting from version `v1.0.0`, this project uses `sherpa-onnx` to run the `SenseVoiceSmall` (int8 quantized) model as the default speech recognition solution. This is an out-of-the-box configuration - you don't need any additional setup. The system will automatically download and extract model files to the project's `models` directory on first run.
2525
:::
2626

27-
### CUDA Inference
28-
`sherpa-onnx` supports both CPU and CUDA inference. Although the default `SenseVoiceSmall` model performs well on CPU, if you have an NVIDIA GPU, you can enable CUDA inference for better performance by following these steps:
27+
### Recommended Users
28+
- All users (hence it's the default)
29+
- Especially Mac users (due to limited options)
30+
- Non-NVIDIA GPU users
31+
- Chinese users
32+
- Fast CPU inference
33+
- Configuration difficulty: No configuration needed as it's the project default
2934

30-
[Official doc on CUDA](https://k2-fsa.github.io/sherpa/onnx/python/install.html#method-2-from-pre-compiled-wheels-cpu-cuda)
35+
The SenseVoiceSmall model may have average English performance.
3136

32-
:::warning
33-
Warning: We suspect that sherpa onnx only support CUDA 11.8, but I'm not sure. Please refer to the [official documentation](https://k2-fsa.github.io/sherpa/onnx/python/install.html#method-2-from-pre-compiled-wheels-cpu-cuda) for CUDA 11.8 installation and more details.
34-
:::
37+
### CUDA Inference
38+
`sherpa-onnx` supports both CPU and CUDA inference. While the default `SenseVoiceSmall` model performs well on CPU, if you have an NVIDIA GPU, you can enable CUDA inference for better performance by following these steps:
3539

36-
1. First, uninstall the CPU version dependencies:
40+
1. First uninstall the CPU version dependencies:
3741
```sh
3842
uv remove sherpa-onnx onnxruntime
43+
# Avoid introducing onnxruntime through dependencies
44+
uv remove faster-whisper
3945
```
4046

41-
1. Install the CUDA version of `sherpa-onnx` and `onnxruntime-gpu` dependencies:
47+
> Note that sherpa-onnx is installed via pre-built wheels in the example, which means you need to install
48+
>
49+
> CUDA Toolkit 11.x + CUDNN 8.x for CUDA 11.x (and add `%SystemDrive%\Program Files\NVIDIA\CUDNN\v8.x\bin` to your `PATH`)
50+
>
51+
> > Where x is your cudnn minor version number, e.g., for version `v8.9.7`, write `v8.9` here.
52+
>
53+
> to link to the correct CUDA environment.
54+
>
55+
> If you don't want to use the NVIDIA official installer/manually set PATH, consider using [`pixi`](https://pixi.sh/) to manage a local conda environment.
56+
> This approach doesn't require you to install dependencies via uv.
57+
>
58+
> ```nushell
59+
> pixi remove --pypi onnxruntime sherpa-onnx
60+
> pixi add --pypi onnxruntime-gpu==1.17.1 pip
61+
> pixi run python -m pip install sherpa-onnx==1.10.39+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
62+
> ```
63+
64+
2. Install CUDA version of `sherpa-onnx` and `onnxruntime-gpu` dependencies:
4265
```sh
43-
uv add onnxruntime-gpu sherpa-onnx==1.10.39+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
66+
# sherpa-onnx provided pre-built wheels are compatible with onnxruntime-gpu==1.17.1
67+
uv add onnxruntime-gpu==1.17.1 sherpa-onnx==1.10.39+cuda -f https://k2-fsa.github.io/sherpa/onnx/cuda.html
4468
```
4569
46-
1. Modify the configuration file:
70+
3. Modify configuration file:
4771
In `conf.yaml`, find the `sherpa_onnx_asr` section and set `provider` to `cuda`
4872

4973
### Using Other sherpa-onnx Models
5074

5175
If you want to try other speech recognition models:
52-
1. Download the desired model from [sherpa-onnx ASR models](https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models)
76+
1. Download the required model from [sherpa-onnx ASR models](https://github.com/k2-fsa/sherpa-onnx/releases/tag/asr-models)
5377
2. Place the model files in the project's `models` directory
54-
3. Modify the relevant configurations for `sherpa_onnx_asr` in `conf.yaml` according to the instructions
78+
3. Modify the relevant configuration of `sherpa_onnx_asr` according to the instructions in `conf.yaml`
5579

5680
## `fun_asr` (Local)
5781

58-
[FunASR](https://github.com/modelscope/FunASR?tab=readme-ov-file) is a basic end-to-end speech recognition toolkit from ModelScope that supports various ASR models. Among them, Alibaba's [FunAudioLLM](https://github.com/FunAudioLLM/SenseVoice) SenseVoiceSmall model performs well in terms of both performance and speed.
82+
[FunASR](https://github.com/modelscope/FunASR?tab=readme-ov-file) is a fundamental end-to-end speech recognition toolkit from ModelScope that supports various ASR models. Among them, Alibaba's [FunAudioLLM](https://github.com/FunAudioLLM/SenseVoice) SenseVoiceSmall model performs well in both performance and speed.
5983

6084
:::tip
61-
Although FunASR can run the SenseVoiceSmall model, we recommend using the project's default `sherpa_onnx_asr`. The FunASR project has some stability issues and may encounter anomalies on certain devices.
85+
Although FunASR can run the SenseVoiceSmall model, we recommend using the project's default `sherpa_onnx_asr`. The FunASR project has some stability issues and may encounter exceptions on certain devices.
86+
87+
However, FunASR utilizes GPU better, so it might be faster for NVIDIA GPU users.
6288
:::
6389

90+
### Recommended Users
91+
- Users with NVIDIA GPUs who want to utilize GPU inference for the SenseVoiceSmall model
92+
- Chinese users
93+
- Fast CPU inference
94+
- Configuration difficulty: Simple
95+
96+
SenseVoiceSmall may have average English performance.
97+
6498
### Installation
6599

66-
In the project directory, run
100+
In the project directory, run:
67101
```sh
68102
uv add funasr modelscope huggingface_hub onnxconverter_common torch torchaudio onnx
69103
```
70104

71-
:::info Dependency Issue Solution
72-
If you encounter the following dependency issue:
105+
:::info Dependency Issue Solutions
106+
If you encounter the following dependency issues:
73107

74108
```sh
75109
help: `llvmlite` (v0.36.0) was included because `open-llm-vtuber` (v1.0.0a1) depends on `funasr` (v1.2.2) which depends on `umap-learn` (v0.5.7)
@@ -83,29 +117,89 @@ uv pip install funasr modelscope huggingface_hub torch torchaudio onnx onnxconve
83117
:::
84118

85119
:::warning
86-
Even if the model files are already local, an internet connection is still required at startup.
120+
Even if model files are already local, an internet connection is still required at startup.
87121

88-
Solution: Directly specify the local path of the model in the configuration, so it doesn't need to connect to the internet when running. But you need to download the model files in advance. See [FunASR Issue #1897](https://github.com/modelscope/FunASR/issues/1897) for details
122+
Solution: Directly specify the local path of the model in the configuration, so no internet connection is needed during runtime. However, you need to download the model files in advance. See [FunASR Issue #1897](https://github.com/modelscope/FunASR/issues/1897) for details.
89123
:::
90124

91125
## `faster_whisper` (Local)
126+
- [Official Repository](https://github.com/SYSTRAN/faster-whisper)
92127

93128
This is an optimized Whisper inference engine that can run original Whisper and distilled Whisper models. It provides faster inference speed compared to the original Whisper but cannot automatically detect language.
94129

95130
:::info
96-
On macOS systems, as it can only run on CPU, the performance is average. It is recommended to use it on devices equipped with NVIDIA GPUs for the best performance.
131+
Faster Whisper [does not support Mac GPU inference](https://github.com/SYSTRAN/faster-whisper/issues/911) and can only run on CPU with average performance. It's recommended for use on devices equipped with NVIDIA GPUs for optimal performance.
97132
:::
98133

134+
### Recommended Users
135+
- Users with NVIDIA GPUs who want to utilize GPU inference for Whisper models
136+
- Non-Chinese users. Whisper series models have good multilingual support
137+
- CPU inference is relatively slow
138+
- Configuration difficulty: Simple
139+
140+
### Installation and Configuration
141+
99142
If you want to use GPU acceleration (NVIDIA GPU users only), you need to install the following NVIDIA dependency libraries. For detailed installation steps, please refer to [Quick Start](/docs/quick-start.md):
100143
- [cuBLAS for CUDA 12](https://developer.nvidia.com/cublas)
101144
- [cuDNN 8 for CUDA 12](https://developer.nvidia.com/cudnn)
102145

103-
If you don't care much about running speed, or if you have a powerful CPU, you can also choose to set the `device` parameter of `faster-whisper` to `cpu` in the `conf.yaml` configuration file. This way, you can avoid the hassle of installing NVIDIA dependency libraries.
146+
If you don't care much about running speed or have a powerful CPU, you can also set the `device` parameter of `faster-whisper` to `cpu` in the `conf.yaml` configuration file. This avoids the hassle of installing NVIDIA dependency libraries.
147+
148+
```yaml
149+
# Faster Whisper Configuration
150+
faster_whisper:
151+
model_path: 'large-v3-turbo' # Model path, model name, or HF hub model id
152+
download_root: 'models/whisper' # Model download root directory
153+
language: 'zh' # Language, en, zh or others. Leave empty for auto-detection
154+
device: 'auto' # Device, cpu, cuda or auto. faster-whisper doesn't support mps
155+
compute_type: 'int8'
156+
```
157+
158+
### Model Selection (model_path)
159+
`model_path` can be filled with model name, local path of the model (if you downloaded it in advance), or model id on HuggingFace (must be a model already converted to CTranslate2 format).
160+
161+
**Available model names:**
162+
163+
`tiny`, `tiny.en`, `base`, `base.en`, `small`, `small.en`, `distil-small.en`, `medium`, `medium.en`, `distil-medium.en`, `large-v1`, `large-v2`, `large-v3`, `large`, `distil-large-v2`, `distil-large-v3`, `large-v3-turbo`, `turbo`
164+
165+
The distil series models may only support English.
166+
167+
The selected model will be automatically downloaded from Hugging Face to the `models/whisper` folder in the project directory.
168+
169+
Test results on 4060 (Thanks to Lena from the QQ group for providing test results in [#187](https://github.com/Open-LLM-VTuber/Open-LLM-VTuber/issues/187#issuecomment-2814846254), [#188](https://github.com/Open-LLM-VTuber/Open-LLM-VTuber/pull/188))
170+
171+
> Using 22-second generated audio, tested with int8 on 13th gen i5 and 4060 8GB, CUDA 12.8, cuDNN 9.8:
172+
> - CPU: v3-turbo took 5.98 seconds, small took 1.56 seconds
173+
> - GPU: v3-turbo took 1.04 seconds, small took 0.48 seconds
174+
>
175+
> Summary:
176+
> - Without 4060, choose small, because medium and v3-turbo are similar in size, small might be the best recognition effect while ensuring speed for 20/30 series cards.
177+
> - With 4060, choose v3-turbo, higher accuracy is naturally better if speed is not an issue.
178+
> - Accuracy reference: faster-whisper-small has 244M parameters, faster-whisper-v3-turbo has 809M parameters.
179+
180+
Test results on MacBook Pro M1 Pro:
181+
> Don't even try, it's very slow. Using whisper cpp with CoreML acceleration or sense voice small model would be much faster.
182+
183+
**Hugging Face model id format**
184+
```
185+
"username/whisper-large-v3-ct2"
186+
```
187+
Note that faster whisper requires models already converted to CTranslate2 format.
188+
189+
The selected model will be automatically downloaded from Hugging Face to the `models/whisper` folder in the project directory.
104190
105191
## `whisper_cpp` (Local)
106-
- `whipser_cpp` can be accelerated via CoreML on macOS, achieving faster inference speeds
192+
- `whisper_cpp` can be accelerated through CoreML on macOS for faster inference speed
107193
- When running on CPU or NVIDIA GPU, performance may not be as good as Faster-Whisper
108-
- Mac users, please refer to the instructions below to configure WhisperCPP with CoreML support; if you need to use CPU or NVIDIA GPU, just run `pip install pywhispercpp` to install
194+
- Mac users please refer to the instructions below to configure WhisperCPP with CoreML support; if you need to use CPU or NVIDIA GPU, just run `pip install pywhispercpp` to install
195+
196+
### Recommended Users
197+
- Mac users who want to utilize GPU inference for Whisper series models
198+
- Chinese users
199+
- CPU inference is relatively slow, GPU is needed
200+
- Configuration difficulty: Setting up GPU acceleration might be a bit challenging
201+
202+
SenseVoiceSmall may have average English performance.
109203
110204
### Installation
111205
@@ -134,23 +228,40 @@ GGML_VULKAN=1 pip install git+https://github.com/absadiki/pywhispercpp
134228
</Tabs>
135229

136230
### CoreML Configuration
137-
- Method 1: Follow the instructions in the Whisper.cpp repository documentation to convert the Whisper model to CoreML format
138-
- Method 2: Download the pre-converted CoreML model from the [Hugging Face repository](https://huggingface.cochidiwilliams/whisper.cpp-coreml/tree/main). Note: After downloading, you need to unzip the model file, otherwise the program cannot load it and will crash.
139-
- Configuration instructions: When configuring the model in `conf.yaml`, you don't need to include the special prefix in the file name. For example, when the CoreML model file name is `ggml-base-encoder.mlmodelc`, you only need to fill in `base` in the `model_name` parameter of `WhisperCPP`.
231+
- Method 1: Follow the Whisper.cpp repository documentation to convert Whisper models to CoreML format
232+
- Method 2: Download pre-converted CoreML models from [Hugging Face repository](https://huggingface.co/chidiwilliams/whisper.cpp-coreml/tree/main). Note: After downloading, you need to extract the model files, otherwise the program cannot load and will crash.
233+
- Configuration note: When configuring models in `conf.yaml`, you don't need to include the special prefix in the filename. For example, when the CoreML model filename is `ggml-base-encoder.mlmodelc`, you only need to fill in `base` in the `model_name` parameter of `WhisperCPP`.
140234

141235
## `whisper` (Local)
142236

143-
OpenAI's original Whisper. Install using `uv pip install -U openai-whisper`. Inference speed is very slow.
237+
OpenAI's original Whisper. Install with `uv pip install -U openai-whisper`. Very slow inference speed.
238+
239+
### Recommended Users
240+
- Not recommended
241+
242+
## `groq_whisper_asr` (Online, requires API key, but easy to register with generous free quota)
144243

145-
## `groq_whisper_asr` (API key required)
244+
Groq's Whisper endpoint, very accurate (supports multiple languages) and fast, with many free uses per day. It's pre-installed. Get an API key from [groq](https://console.groq.com/keys) and add it to the `groq_whisper_asr` settings in `conf.yaml`. Users in mainland China and other unsupported regions need a proxy (may not support Hong Kong region) to use it.
146245

147-
Groq's Whisper endpoint, very accurate (supports multiple languages) and fast, with many free uses per day. It's pre-installed. Get an API key from [groq](https://console.groq.com/keys) and add it to the `groq_whisper_asr` settings in `conf.yaml`. For mainland China and other unsupported regions, a proxy is required (Hong Kong region is not supported) to use.
246+
### Recommended Users
247+
- Users who accept using online speech recognition
248+
- Multilingual users
249+
- No local computation, very fast speed (depends on your network speed)
250+
- Configuration difficulty: Simple
148251

149-
## `azure_asr` (API key required)
252+
SenseVoiceSmall may have average English performance.
150253

151-
- Azure Speech Recognition.
254+
## `azure_asr` (Online, requires API key)
255+
256+
- Azure Speech Recognition
152257
- Configure API key and region under the `azure_asr` option
153258

154259
:::warning
155260
`api_key.py` has been deprecated after `v0.2.5`. Please set API keys in `conf.yaml`.
156-
:::
261+
:::
262+
263+
### Recommended Users
264+
- People who have Azure API keys (Azure accounts are not easy to register)
265+
- Multilingual users
266+
- No local computation, very fast speed (depends on your network speed)
267+
- Configuration difficulty: Simple

0 commit comments

Comments
 (0)