GitHub - congjinruo/funasr-api: Speech recognition API service powered by FunASR and Qwen-ASR, supporting 52 languages, compatible with OpenAI API and Alibaba Cloud Speech API. 基于 FunASR 与 Qwen3-ASR 的语音识别 API 服务，支持 52 种语言，兼容 OpenAI API 与阿里云语音 API。

Ready-to-use Local Speech Recognition API Service

Speech recognition API service powered by FunASR and Qwen3-ASR, supporting 52 languages, compatible with OpenAI API and Alibaba Cloud Speech API.

简体中文

Live Demo Site

Web Demo: https://asr.vect.one

Demo

Features

Multi-Model Support - Integrates Qwen3-ASR 1.7B/0.6B and Paraformer Large ASR models
Speaker Diarization - Automatic multi-speaker identification using CAM++ model
OpenAI API Compatible - Supports /v1/audio/transcriptions endpoint, works with OpenAI SDK
Alibaba Cloud API Compatible - Supports Alibaba Cloud Speech RESTful API and WebSocket streaming protocol
WebSocket Streaming - Real-time streaming speech recognition with low latency
Smart Far-Field Filtering - Automatically filters far-field sounds and ambient noise in streaming ASR
Intelligent Audio Segmentation - VAD-based greedy merge algorithm for automatic long audio splitting
GPU Batch Processing - Batch inference support, 2-3x faster than sequential processing
Flexible Configuration - Environment variable based configuration, load models on demand

Quick Deployment

1. Docker Deployment (Recommended)

# Copy and edit configuration
cp .env.example .env
# Edit .env to set ENABLED_MODELS and API_KEY (optional)

# Start service (GPU version)
docker-compose up -d

# Or CPU version
docker-compose -f docker-compose-cpu.yml up -d

# Multi-GPU auto mode (one instance per visible GPU)
CUDA_VISIBLE_DEVICES=0,1,2,3 docker-compose up -d

Service URLs:

API Endpoint: http://localhost:17003
API Docs: http://localhost:17003/docs

Optional built-in rate limit settings:

NGINX_RATE_LIMIT_RPS (global requests/sec, 0 = disabled)
NGINX_RATE_LIMIT_BURST (global burst, 0 = auto use RPS)

docker run (alternative):

# GPU version
docker run -d --name funasr-api \
  --gpus all \
  -p 17003:8000 \
  -e ENABLED_MODELS=auto \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  -e API_KEY=your_api_key \
  -v ./models/modelscope:/root/.cache/modelscope \
  -v ./models/huggingface:/root/.cache/huggingface \
  quantatrisk/funasr-api:gpu-latest

# CPU version
docker run -d --name funasr-api \
  -p 17003:8000 \
  -e ENABLED_MODELS=paraformer-large \
  quantatrisk/funasr-api:cpu-latest

Note: CPU environment automatically filters Qwen3 models (vLLM requires GPU)

Offline Deployment: Use the helper script to prepare models, then copy to the offline machine:

# 1. Prepare models (interactive selection)
./scripts/prepare-models.sh

# 2. Copy the package to offline server
scp funasr-models-*.tar.gz user@server:/opt/funasr-api/

# 3. On offline server, extract and start
tar -xzvf funasr-models-*.tar.gz
docker-compose up -d

See MODEL_SETUP.md for more details.

Detailed deployment instructions: Deployment Guide

Local Development

System Requirements:

Python 3.10+
CUDA 12.6+ (optional, for GPU acceleration)
FFmpeg (audio format conversion)

Installation:

# Clone project
cd FunASR-API

# Install dependencies
pip install -r requirements.txt

# Start service
python start.py

API Endpoints

OpenAI Compatible API

Endpoint	Method	Function
`/v1/audio/transcriptions`	POST	Audio transcription (OpenAI compatible)
`/v1/models`	GET	Model list

Request Parameters:

Parameter	Type	Default	Description
`file`	file	Mutually exclusive with `audio_address`	Audio file
`audio_address`	string	Mutually exclusive with `file`	Audio file URL (HTTP/HTTPS)
`model`	string	auto-detect	Model selection (qwen3-asr-1.7b, qwen3-asr-0.6b, paraformer-large)
`language`	string	Auto-detect	Language code (zh/en/ja)
`enable_speaker_diarization`	bool	`true`	Enable speaker diarization
`word_timestamps`	bool	`true`	Return word-level timestamps (Qwen3-ASR only)
`response_format`	string	`verbose_json`	Output format
`prompt`	string	-	Prompt text (reserved)
`temperature`	float	`0`	Sampling temperature (reserved)

Audio Input Methods:

File Upload: Use file parameter to upload audio file (standard OpenAI way)
URL Download: Use audio_address parameter to provide audio URL, service will download automatically

Usage Examples:

# Using OpenAI SDK
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="your_api_key")

with open("audio.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",  # Maps to default model
        file=f,
        response_format="verbose_json"  # Get segments and speaker info
    )
print(transcript.text)

# Using curl
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
  -H "Authorization: Bearer your_api_key" \
  -F "file=@audio.wav" \
  -F "model=paraformer-large" \
  -F "response_format=verbose_json" \
  -F "enable_speaker_diarization=true"

Supported Response Formats: json, text, srt, vtt, verbose_json

Alibaba Cloud Compatible API

Endpoint	Method	Function
`/stream/v1/asr`	POST	Speech recognition (long audio support)
`/stream/v1/asr/models`	GET	Model list
`/stream/v1/asr/health`	GET	Health check
`/ws/v1/asr`	WebSocket	Streaming ASR (Alibaba Cloud protocol compatible)
`/ws/v1/asr/funasr`	WebSocket	FunASR streaming (backward compatible)
`/ws/v1/asr/qwen`	WebSocket	Qwen3-ASR streaming
`/ws/v1/asr/test`	GET	WebSocket test page

Request Parameters:

Parameter	Type	Default	Description
`model_id`	string	auto-detect	Model ID
`audio_address`	string	-	Audio URL (optional)
`sample_rate`	int	`16000`	Sample rate
`enable_speaker_diarization`	bool	`true`	Enable speaker diarization
`word_timestamps`	bool	`true`	Return word-level timestamps (Qwen3-ASR only)
`vocabulary_id`	string	-	Hotwords (format: `word1 weight1 word2 weight2`)

Usage Examples:

# Basic usage
curl -X POST "http://localhost:8000/stream/v1/asr" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @audio.wav

# With parameters
curl -X POST "http://localhost:8000/stream/v1/asr?enable_speaker_diarization=true" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @audio.wav

Response Example:

{
  "task_id": "xxx",
  "status": 200,
  "message": "SUCCESS",
  "result": "Speaker1 content...\nSpeaker2 content...",
  "duration": 60.5,
  "processing_time": 1.234,
  "segments": [
    {
      "text": "Today is a nice day.",
      "start_time": 0.0,
      "end_time": 2.5,
      "speaker_id": "Speaker1",
      "word_tokens": [
        {"text": "Today", "start_time": 0.0, "end_time": 0.5},
        {"text": "is", "start_time": 0.5, "end_time": 0.7},
        {"text": "a nice day", "start_time": 0.7, "end_time": 1.5}
      ]
    }
  ]
}

WebSocket Streaming Test: Visit http://localhost:8000/ws/v1/asr/test

Speaker Diarization

Multi-speaker automatic identification based on CAM++ model:

Enabled by Default - enable_speaker_diarization=true
Automatic Detection - No preset speaker count needed, model auto-detects
Speaker Labels - Response includes speaker_id field (e.g., "Speaker1", "Speaker2")
Smart Merging - Two-layer merge strategy to avoid isolated short segments:
- Layer 1: Accumulate merge same-speaker segments < 10 seconds
- Layer 2: Accumulate merge continuous segments up to 60 seconds
Subtitle Support - SRT/VTT output includes speaker labels [Speaker1] text content

Disable speaker diarization:

# OpenAI API
-F "enable_speaker_diarization=false"

# Alibaba Cloud API
?enable_speaker_diarization=false

Audio Processing

Intelligent Segmentation Strategy

Automatic long audio segmentation:

VAD Voice Detection - Detect voice boundaries, filter silence
Greedy Merge - Accumulate voice segments, ensure each segment does not exceed MAX_SEGMENT_SEC (default 90s)
Silence Split - Force split when silence between voice segments exceeds 3 seconds
Batch Inference - Multi-segment parallel processing, 2-3x performance improvement in GPU mode

WebSocket Streaming Limitations

FunASR Model Limitations (using /ws/v1/asr or /ws/v1/asr/funasr):

✅ Real-time speech recognition, low latency
✅ Sentence-level timestamps
❌ Word-level timestamps (not implemented)
❌ Confidence scores (not implemented)

Qwen3-ASR Streaming (using /ws/v1/asr/qwen):

✅ Word-level timestamps
✅ Multi-language real-time recognition

Supported Models

Model ID	Name	Description	Features
`qwen3-asr-1.7b`	Qwen3-ASR 1.7B	High-performance multilingual ASR, 52 languages + dialects, vLLM backend	Offline/Realtime
`qwen3-asr-0.6b`	Qwen3-ASR 0.6B	Lightweight multilingual ASR, suitable for low VRAM environments	Offline/Realtime
`paraformer-large`	Paraformer Large	High-precision Chinese speech recognition	Offline/Realtime

Model Selection:

Use ENABLED_MODELS environment variable to control which models to load:

# Options: auto, all, or comma-separated list
ENABLED_MODELS=auto                    # Auto-detect GPU and load appropriate models
ENABLED_MODELS=all                     # Load all available models
ENABLED_MODELS=paraformer-large        # Only Paraformer
ENABLED_MODELS=qwen3-asr-0.6b          # Only Qwen3 0.6B
ENABLED_MODELS=paraformer-large,qwen3-asr-0.6b  # Both

Auto mode behavior:

VRAM >= 32GB: Auto-load qwen3-asr-1.7b + paraformer-large
VRAM < 32GB: Auto-load qwen3-asr-0.6b + paraformer-large
No CUDA: Only paraformer-large (Qwen3 requires vLLM/GPU)

Environment Variables

Variable	Default	Description
`ENABLED_MODELS`	`auto`	Models to load: `auto`, `all`, or comma-separated list
`API_KEY`	-	API authentication key (optional, unauthenticated if not set)
`LOG_LEVEL`	`INFO`	Log level (DEBUG/INFO/WARNING/ERROR)
`MAX_AUDIO_SIZE`	`2048`	Max audio file size (MB, supports units like 2GB)
`ASR_BATCH_SIZE`	`4`	ASR batch size (GPU: 4, CPU: 2)
`MAX_SEGMENT_SEC`	`90`	Max audio segment duration (seconds)
`ENABLE_STREAMING_VLLM`	`false`	Load streaming VLLM instance (saves VRAM)
`MODELSCOPE_PATH`	`~/.cache/modelscope/hub/models`	ModelScope cache path
`HF_HOME`	`~/.cache/huggingface`	HuggingFace cache path (GPU mode)
`ASR_ENABLE_LM`	`true`	Enable language model (Paraformer)
`ASR_ENABLE_NEARFIELD_FILTER`	`true`	Enable far-field sound filtering

Detailed configuration: Near-Field Filter Docs

Resource Requirements

Minimum (CPU):

CPU: 4 cores
Memory: 16GB
Disk: 20GB

Recommended (GPU):

CPU: 4 cores
Memory: 16GB
GPU: NVIDIA GPU (16GB+ VRAM)
Disk: 20GB

API Documentation

After starting the service:

Swagger UI: http://localhost:8000/docs
ReDoc: http://localhost:8000/redoc

Links

Deployment Guide: Detailed Docs
Near-Field Filter Config: Config Guide
FunASR: FunASR GitHub
Chinese README: 中文文档

License

This project uses the MIT License - see LICENSE file for details.

Star History

Contributing

Issues and Pull Requests are welcome to improve the project!

Name		Name	Last commit message	Last commit date
Latest commit History 302 Commits
.github/workflows		.github/workflows
app		app
demo		demo
docs		docs
scripts		scripts
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile.cpu		Dockerfile.cpu
Dockerfile.gpu		Dockerfile.gpu
README.md		README.md
build.sh		build.sh
docker-compose-cpu.yml		docker-compose-cpu.yml
docker-compose.yml		docker-compose.yml
requirements-cpu.txt		requirements-cpu.txt
requirements.txt		requirements.txt
start.py		start.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ready-to-use Local Speech Recognition API Service

Live Demo Site

Demo

Features

Quick Deployment

1. Docker Deployment (Recommended)

Local Development

API Endpoints

OpenAI Compatible API

Alibaba Cloud Compatible API

Speaker Diarization

Audio Processing

Intelligent Segmentation Strategy

WebSocket Streaming Limitations

Supported Models

Environment Variables

Resource Requirements

API Documentation

Links

License

Star History

Contributing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Ready-to-use Local Speech Recognition API Service

Live Demo Site

Demo

Features

Quick Deployment

1. Docker Deployment (Recommended)

Local Development

API Endpoints

OpenAI Compatible API

Alibaba Cloud Compatible API

Speaker Diarization

Audio Processing

Intelligent Segmentation Strategy

WebSocket Streaming Limitations

Supported Models

Environment Variables

Resource Requirements

API Documentation

Links

License

Star History

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages