Speech recognition API service powered by FunASR and Qwen3-ASR, supporting 52 languages, compatible with OpenAI API and Alibaba Cloud Speech API.
- Web Demo: https://asr.vect.one
- Multi-Model Support - Integrates Qwen3-ASR 1.7B/0.6B and Paraformer Large ASR models
- Speaker Diarization - Automatic multi-speaker identification using CAM++ model
- OpenAI API Compatible - Supports
/v1/audio/transcriptionsendpoint, works with OpenAI SDK - Alibaba Cloud API Compatible - Supports Alibaba Cloud Speech RESTful API and WebSocket streaming protocol
- WebSocket Streaming - Real-time streaming speech recognition with low latency
- Smart Far-Field Filtering - Automatically filters far-field sounds and ambient noise in streaming ASR
- Intelligent Audio Segmentation - VAD-based greedy merge algorithm for automatic long audio splitting
- GPU Batch Processing - Batch inference support, 2-3x faster than sequential processing
- Flexible Configuration - Environment variable based configuration, load models on demand
# Copy and edit configuration
cp .env.example .env
# Edit .env to set ENABLED_MODELS and API_KEY (optional)
# Start service (GPU version)
docker-compose up -d
# Or CPU version
docker-compose -f docker-compose-cpu.yml up -d
# Multi-GPU auto mode (one instance per visible GPU)
CUDA_VISIBLE_DEVICES=0,1,2,3 docker-compose up -dService URLs:
- API Endpoint:
http://localhost:17003 - API Docs:
http://localhost:17003/docs
Optional built-in rate limit settings:
NGINX_RATE_LIMIT_RPS(global requests/sec,0= disabled)NGINX_RATE_LIMIT_BURST(global burst,0= auto use RPS)
docker run (alternative):
# GPU version
docker run -d --name funasr-api \
--gpus all \
-p 17003:8000 \
-e ENABLED_MODELS=auto \
-e CUDA_VISIBLE_DEVICES=0,1,2,3 \
-e API_KEY=your_api_key \
-v ./models/modelscope:/root/.cache/modelscope \
-v ./models/huggingface:/root/.cache/huggingface \
quantatrisk/funasr-api:gpu-latest
# CPU version
docker run -d --name funasr-api \
-p 17003:8000 \
-e ENABLED_MODELS=paraformer-large \
quantatrisk/funasr-api:cpu-latestNote: CPU environment automatically filters Qwen3 models (vLLM requires GPU)
Offline Deployment: Use the helper script to prepare models, then copy to the offline machine:
# 1. Prepare models (interactive selection)
./scripts/prepare-models.sh
# 2. Copy the package to offline server
scp funasr-models-*.tar.gz user@server:/opt/funasr-api/
# 3. On offline server, extract and start
tar -xzvf funasr-models-*.tar.gz
docker-compose up -dSee MODEL_SETUP.md for more details.
Detailed deployment instructions: Deployment Guide
System Requirements:
- Python 3.10+
- CUDA 12.6+ (optional, for GPU acceleration)
- FFmpeg (audio format conversion)
Installation:
# Clone project
cd FunASR-API
# Install dependencies
pip install -r requirements.txt
# Start service
python start.py| Endpoint | Method | Function |
|---|---|---|
/v1/audio/transcriptions |
POST | Audio transcription (OpenAI compatible) |
/v1/models |
GET | Model list |
Request Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
file |
file | Mutually exclusive with audio_address |
Audio file |
audio_address |
string | Mutually exclusive with file |
Audio file URL (HTTP/HTTPS) |
model |
string | auto-detect | Model selection (qwen3-asr-1.7b, qwen3-asr-0.6b, paraformer-large) |
language |
string | Auto-detect | Language code (zh/en/ja) |
enable_speaker_diarization |
bool | true |
Enable speaker diarization |
word_timestamps |
bool | true |
Return word-level timestamps (Qwen3-ASR only) |
response_format |
string | verbose_json |
Output format |
prompt |
string | - | Prompt text (reserved) |
temperature |
float | 0 |
Sampling temperature (reserved) |
Audio Input Methods:
- File Upload: Use
fileparameter to upload audio file (standard OpenAI way) - URL Download: Use
audio_addressparameter to provide audio URL, service will download automatically
Usage Examples:
# Using OpenAI SDK
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="your_api_key")
with open("audio.wav", "rb") as f:
transcript = client.audio.transcriptions.create(
model="whisper-1", # Maps to default model
file=f,
response_format="verbose_json" # Get segments and speaker info
)
print(transcript.text)# Using curl
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
-H "Authorization: Bearer your_api_key" \
-F "file=@audio.wav" \
-F "model=paraformer-large" \
-F "response_format=verbose_json" \
-F "enable_speaker_diarization=true"Supported Response Formats: json, text, srt, vtt, verbose_json
| Endpoint | Method | Function |
|---|---|---|
/stream/v1/asr |
POST | Speech recognition (long audio support) |
/stream/v1/asr/models |
GET | Model list |
/stream/v1/asr/health |
GET | Health check |
/ws/v1/asr |
WebSocket | Streaming ASR (Alibaba Cloud protocol compatible) |
/ws/v1/asr/funasr |
WebSocket | FunASR streaming (backward compatible) |
/ws/v1/asr/qwen |
WebSocket | Qwen3-ASR streaming |
/ws/v1/asr/test |
GET | WebSocket test page |
Request Parameters:
| Parameter | Type | Default | Description |
|---|---|---|---|
model_id |
string | auto-detect | Model ID |
audio_address |
string | - | Audio URL (optional) |
sample_rate |
int | 16000 |
Sample rate |
enable_speaker_diarization |
bool | true |
Enable speaker diarization |
word_timestamps |
bool | true |
Return word-level timestamps (Qwen3-ASR only) |
vocabulary_id |
string | - | Hotwords (format: word1 weight1 word2 weight2) |
Usage Examples:
# Basic usage
curl -X POST "http://localhost:8000/stream/v1/asr" \
-H "Content-Type: application/octet-stream" \
--data-binary @audio.wav
# With parameters
curl -X POST "http://localhost:8000/stream/v1/asr?enable_speaker_diarization=true" \
-H "Content-Type: application/octet-stream" \
--data-binary @audio.wavResponse Example:
{
"task_id": "xxx",
"status": 200,
"message": "SUCCESS",
"result": "Speaker1 content...\nSpeaker2 content...",
"duration": 60.5,
"processing_time": 1.234,
"segments": [
{
"text": "Today is a nice day.",
"start_time": 0.0,
"end_time": 2.5,
"speaker_id": "Speaker1",
"word_tokens": [
{"text": "Today", "start_time": 0.0, "end_time": 0.5},
{"text": "is", "start_time": 0.5, "end_time": 0.7},
{"text": "a nice day", "start_time": 0.7, "end_time": 1.5}
]
}
]
}WebSocket Streaming Test: Visit http://localhost:8000/ws/v1/asr/test
Multi-speaker automatic identification based on CAM++ model:
- Enabled by Default -
enable_speaker_diarization=true - Automatic Detection - No preset speaker count needed, model auto-detects
- Speaker Labels - Response includes
speaker_idfield (e.g., "Speaker1", "Speaker2") - Smart Merging - Two-layer merge strategy to avoid isolated short segments:
- Layer 1: Accumulate merge same-speaker segments < 10 seconds
- Layer 2: Accumulate merge continuous segments up to 60 seconds
- Subtitle Support - SRT/VTT output includes speaker labels
[Speaker1] text content
Disable speaker diarization:
# OpenAI API
-F "enable_speaker_diarization=false"
# Alibaba Cloud API
?enable_speaker_diarization=falseAutomatic long audio segmentation:
- VAD Voice Detection - Detect voice boundaries, filter silence
- Greedy Merge - Accumulate voice segments, ensure each segment does not exceed
MAX_SEGMENT_SEC(default 90s) - Silence Split - Force split when silence between voice segments exceeds 3 seconds
- Batch Inference - Multi-segment parallel processing, 2-3x performance improvement in GPU mode
FunASR Model Limitations (using /ws/v1/asr or /ws/v1/asr/funasr):
- ✅ Real-time speech recognition, low latency
- ✅ Sentence-level timestamps
- ❌ Word-level timestamps (not implemented)
- ❌ Confidence scores (not implemented)
Qwen3-ASR Streaming (using /ws/v1/asr/qwen):
- ✅ Word-level timestamps
- ✅ Multi-language real-time recognition
| Model ID | Name | Description | Features |
|---|---|---|---|
qwen3-asr-1.7b |
Qwen3-ASR 1.7B | High-performance multilingual ASR, 52 languages + dialects, vLLM backend | Offline/Realtime |
qwen3-asr-0.6b |
Qwen3-ASR 0.6B | Lightweight multilingual ASR, suitable for low VRAM environments | Offline/Realtime |
paraformer-large |
Paraformer Large | High-precision Chinese speech recognition | Offline/Realtime |
Model Selection:
Use ENABLED_MODELS environment variable to control which models to load:
# Options: auto, all, or comma-separated list
ENABLED_MODELS=auto # Auto-detect GPU and load appropriate models
ENABLED_MODELS=all # Load all available models
ENABLED_MODELS=paraformer-large # Only Paraformer
ENABLED_MODELS=qwen3-asr-0.6b # Only Qwen3 0.6B
ENABLED_MODELS=paraformer-large,qwen3-asr-0.6b # BothAuto mode behavior:
- VRAM >= 32GB: Auto-load
qwen3-asr-1.7b+paraformer-large - VRAM < 32GB: Auto-load
qwen3-asr-0.6b+paraformer-large - No CUDA: Only
paraformer-large(Qwen3 requires vLLM/GPU)
| Variable | Default | Description |
|---|---|---|
ENABLED_MODELS |
auto |
Models to load: auto, all, or comma-separated list |
API_KEY |
- | API authentication key (optional, unauthenticated if not set) |
LOG_LEVEL |
INFO |
Log level (DEBUG/INFO/WARNING/ERROR) |
MAX_AUDIO_SIZE |
2048 |
Max audio file size (MB, supports units like 2GB) |
ASR_BATCH_SIZE |
4 |
ASR batch size (GPU: 4, CPU: 2) |
MAX_SEGMENT_SEC |
90 |
Max audio segment duration (seconds) |
ENABLE_STREAMING_VLLM |
false |
Load streaming VLLM instance (saves VRAM) |
MODELSCOPE_PATH |
~/.cache/modelscope/hub/models |
ModelScope cache path |
HF_HOME |
~/.cache/huggingface |
HuggingFace cache path (GPU mode) |
ASR_ENABLE_LM |
true |
Enable language model (Paraformer) |
ASR_ENABLE_NEARFIELD_FILTER |
true |
Enable far-field sound filtering |
Detailed configuration: Near-Field Filter Docs
Minimum (CPU):
- CPU: 4 cores
- Memory: 16GB
- Disk: 20GB
Recommended (GPU):
- CPU: 4 cores
- Memory: 16GB
- GPU: NVIDIA GPU (16GB+ VRAM)
- Disk: 20GB
After starting the service:
- Swagger UI:
http://localhost:8000/docs - ReDoc:
http://localhost:8000/redoc
- Deployment Guide: Detailed Docs
- Near-Field Filter Config: Config Guide
- FunASR: FunASR GitHub
- Chinese README: 中文文档
This project uses the MIT License - see LICENSE file for details.
Issues and Pull Requests are welcome to improve the project!
