Skip to content

congjinruo/funasr-api

 
 

Repository files navigation

Ready-to-use Local Speech Recognition API Service

Speech recognition API service powered by FunASR and Qwen3-ASR, supporting 52 languages, compatible with OpenAI API and Alibaba Cloud Speech API.

简体中文


Static Badge Static Badge Static Badge

Live Demo Site

Demo

Demo

Features

  • Multi-Model Support - Integrates Qwen3-ASR 1.7B/0.6B and Paraformer Large ASR models
  • Speaker Diarization - Automatic multi-speaker identification using CAM++ model
  • OpenAI API Compatible - Supports /v1/audio/transcriptions endpoint, works with OpenAI SDK
  • Alibaba Cloud API Compatible - Supports Alibaba Cloud Speech RESTful API and WebSocket streaming protocol
  • WebSocket Streaming - Real-time streaming speech recognition with low latency
  • Smart Far-Field Filtering - Automatically filters far-field sounds and ambient noise in streaming ASR
  • Intelligent Audio Segmentation - VAD-based greedy merge algorithm for automatic long audio splitting
  • GPU Batch Processing - Batch inference support, 2-3x faster than sequential processing
  • Flexible Configuration - Environment variable based configuration, load models on demand

Quick Deployment

1. Docker Deployment (Recommended)

# Copy and edit configuration
cp .env.example .env
# Edit .env to set ENABLED_MODELS and API_KEY (optional)

# Start service (GPU version)
docker-compose up -d

# Or CPU version
docker-compose -f docker-compose-cpu.yml up -d

# Multi-GPU auto mode (one instance per visible GPU)
CUDA_VISIBLE_DEVICES=0,1,2,3 docker-compose up -d

Service URLs:

  • API Endpoint: http://localhost:17003
  • API Docs: http://localhost:17003/docs

Optional built-in rate limit settings:

  • NGINX_RATE_LIMIT_RPS (global requests/sec, 0 = disabled)
  • NGINX_RATE_LIMIT_BURST (global burst, 0 = auto use RPS)

docker run (alternative):

# GPU version
docker run -d --name funasr-api \
  --gpus all \
  -p 17003:8000 \
  -e ENABLED_MODELS=auto \
  -e CUDA_VISIBLE_DEVICES=0,1,2,3 \
  -e API_KEY=your_api_key \
  -v ./models/modelscope:/root/.cache/modelscope \
  -v ./models/huggingface:/root/.cache/huggingface \
  quantatrisk/funasr-api:gpu-latest

# CPU version
docker run -d --name funasr-api \
  -p 17003:8000 \
  -e ENABLED_MODELS=paraformer-large \
  quantatrisk/funasr-api:cpu-latest

Note: CPU environment automatically filters Qwen3 models (vLLM requires GPU)

Offline Deployment: Use the helper script to prepare models, then copy to the offline machine:

# 1. Prepare models (interactive selection)
./scripts/prepare-models.sh

# 2. Copy the package to offline server
scp funasr-models-*.tar.gz user@server:/opt/funasr-api/

# 3. On offline server, extract and start
tar -xzvf funasr-models-*.tar.gz
docker-compose up -d

See MODEL_SETUP.md for more details.

Detailed deployment instructions: Deployment Guide

Local Development

System Requirements:

  • Python 3.10+
  • CUDA 12.6+ (optional, for GPU acceleration)
  • FFmpeg (audio format conversion)

Installation:

# Clone project
cd FunASR-API

# Install dependencies
pip install -r requirements.txt

# Start service
python start.py

API Endpoints

OpenAI Compatible API

Endpoint Method Function
/v1/audio/transcriptions POST Audio transcription (OpenAI compatible)
/v1/models GET Model list

Request Parameters:

Parameter Type Default Description
file file Mutually exclusive with audio_address Audio file
audio_address string Mutually exclusive with file Audio file URL (HTTP/HTTPS)
model string auto-detect Model selection (qwen3-asr-1.7b, qwen3-asr-0.6b, paraformer-large)
language string Auto-detect Language code (zh/en/ja)
enable_speaker_diarization bool true Enable speaker diarization
word_timestamps bool true Return word-level timestamps (Qwen3-ASR only)
response_format string verbose_json Output format
prompt string - Prompt text (reserved)
temperature float 0 Sampling temperature (reserved)

Audio Input Methods:

  • File Upload: Use file parameter to upload audio file (standard OpenAI way)
  • URL Download: Use audio_address parameter to provide audio URL, service will download automatically

Usage Examples:

# Using OpenAI SDK
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="your_api_key")

with open("audio.wav", "rb") as f:
    transcript = client.audio.transcriptions.create(
        model="whisper-1",  # Maps to default model
        file=f,
        response_format="verbose_json"  # Get segments and speaker info
    )
print(transcript.text)
# Using curl
curl -X POST "http://localhost:8000/v1/audio/transcriptions" \
  -H "Authorization: Bearer your_api_key" \
  -F "file=@audio.wav" \
  -F "model=paraformer-large" \
  -F "response_format=verbose_json" \
  -F "enable_speaker_diarization=true"

Supported Response Formats: json, text, srt, vtt, verbose_json

Alibaba Cloud Compatible API

Endpoint Method Function
/stream/v1/asr POST Speech recognition (long audio support)
/stream/v1/asr/models GET Model list
/stream/v1/asr/health GET Health check
/ws/v1/asr WebSocket Streaming ASR (Alibaba Cloud protocol compatible)
/ws/v1/asr/funasr WebSocket FunASR streaming (backward compatible)
/ws/v1/asr/qwen WebSocket Qwen3-ASR streaming
/ws/v1/asr/test GET WebSocket test page

Request Parameters:

Parameter Type Default Description
model_id string auto-detect Model ID
audio_address string - Audio URL (optional)
sample_rate int 16000 Sample rate
enable_speaker_diarization bool true Enable speaker diarization
word_timestamps bool true Return word-level timestamps (Qwen3-ASR only)
vocabulary_id string - Hotwords (format: word1 weight1 word2 weight2)

Usage Examples:

# Basic usage
curl -X POST "http://localhost:8000/stream/v1/asr" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @audio.wav

# With parameters
curl -X POST "http://localhost:8000/stream/v1/asr?enable_speaker_diarization=true" \
  -H "Content-Type: application/octet-stream" \
  --data-binary @audio.wav

Response Example:

{
  "task_id": "xxx",
  "status": 200,
  "message": "SUCCESS",
  "result": "Speaker1 content...\nSpeaker2 content...",
  "duration": 60.5,
  "processing_time": 1.234,
  "segments": [
    {
      "text": "Today is a nice day.",
      "start_time": 0.0,
      "end_time": 2.5,
      "speaker_id": "Speaker1",
      "word_tokens": [
        {"text": "Today", "start_time": 0.0, "end_time": 0.5},
        {"text": "is", "start_time": 0.5, "end_time": 0.7},
        {"text": "a nice day", "start_time": 0.7, "end_time": 1.5}
      ]
    }
  ]
}

WebSocket Streaming Test: Visit http://localhost:8000/ws/v1/asr/test

Speaker Diarization

Multi-speaker automatic identification based on CAM++ model:

  • Enabled by Default - enable_speaker_diarization=true
  • Automatic Detection - No preset speaker count needed, model auto-detects
  • Speaker Labels - Response includes speaker_id field (e.g., "Speaker1", "Speaker2")
  • Smart Merging - Two-layer merge strategy to avoid isolated short segments:
    • Layer 1: Accumulate merge same-speaker segments < 10 seconds
    • Layer 2: Accumulate merge continuous segments up to 60 seconds
  • Subtitle Support - SRT/VTT output includes speaker labels [Speaker1] text content

Disable speaker diarization:

# OpenAI API
-F "enable_speaker_diarization=false"

# Alibaba Cloud API
?enable_speaker_diarization=false

Audio Processing

Intelligent Segmentation Strategy

Automatic long audio segmentation:

  1. VAD Voice Detection - Detect voice boundaries, filter silence
  2. Greedy Merge - Accumulate voice segments, ensure each segment does not exceed MAX_SEGMENT_SEC (default 90s)
  3. Silence Split - Force split when silence between voice segments exceeds 3 seconds
  4. Batch Inference - Multi-segment parallel processing, 2-3x performance improvement in GPU mode

WebSocket Streaming Limitations

FunASR Model Limitations (using /ws/v1/asr or /ws/v1/asr/funasr):

  • ✅ Real-time speech recognition, low latency
  • ✅ Sentence-level timestamps
  • Word-level timestamps (not implemented)
  • Confidence scores (not implemented)

Qwen3-ASR Streaming (using /ws/v1/asr/qwen):

  • ✅ Word-level timestamps
  • ✅ Multi-language real-time recognition

Supported Models

Model ID Name Description Features
qwen3-asr-1.7b Qwen3-ASR 1.7B High-performance multilingual ASR, 52 languages + dialects, vLLM backend Offline/Realtime
qwen3-asr-0.6b Qwen3-ASR 0.6B Lightweight multilingual ASR, suitable for low VRAM environments Offline/Realtime
paraformer-large Paraformer Large High-precision Chinese speech recognition Offline/Realtime

Model Selection:

Use ENABLED_MODELS environment variable to control which models to load:

# Options: auto, all, or comma-separated list
ENABLED_MODELS=auto                    # Auto-detect GPU and load appropriate models
ENABLED_MODELS=all                     # Load all available models
ENABLED_MODELS=paraformer-large        # Only Paraformer
ENABLED_MODELS=qwen3-asr-0.6b          # Only Qwen3 0.6B
ENABLED_MODELS=paraformer-large,qwen3-asr-0.6b  # Both

Auto mode behavior:

  • VRAM >= 32GB: Auto-load qwen3-asr-1.7b + paraformer-large
  • VRAM < 32GB: Auto-load qwen3-asr-0.6b + paraformer-large
  • No CUDA: Only paraformer-large (Qwen3 requires vLLM/GPU)

Environment Variables

Variable Default Description
ENABLED_MODELS auto Models to load: auto, all, or comma-separated list
API_KEY - API authentication key (optional, unauthenticated if not set)
LOG_LEVEL INFO Log level (DEBUG/INFO/WARNING/ERROR)
MAX_AUDIO_SIZE 2048 Max audio file size (MB, supports units like 2GB)
ASR_BATCH_SIZE 4 ASR batch size (GPU: 4, CPU: 2)
MAX_SEGMENT_SEC 90 Max audio segment duration (seconds)
ENABLE_STREAMING_VLLM false Load streaming VLLM instance (saves VRAM)
MODELSCOPE_PATH ~/.cache/modelscope/hub/models ModelScope cache path
HF_HOME ~/.cache/huggingface HuggingFace cache path (GPU mode)
ASR_ENABLE_LM true Enable language model (Paraformer)
ASR_ENABLE_NEARFIELD_FILTER true Enable far-field sound filtering

Detailed configuration: Near-Field Filter Docs

Resource Requirements

Minimum (CPU):

  • CPU: 4 cores
  • Memory: 16GB
  • Disk: 20GB

Recommended (GPU):

  • CPU: 4 cores
  • Memory: 16GB
  • GPU: NVIDIA GPU (16GB+ VRAM)
  • Disk: 20GB

API Documentation

After starting the service:

  • Swagger UI: http://localhost:8000/docs
  • ReDoc: http://localhost:8000/redoc

Links

License

This project uses the MIT License - see LICENSE file for details.

Star History

Star History Chart

Contributing

Issues and Pull Requests are welcome to improve the project!

About

Speech recognition API service powered by FunASR and Qwen-ASR, supporting 52 languages, compatible with OpenAI API and Alibaba Cloud Speech API. 基于 FunASR 与 Qwen3-ASR 的语音识别 API 服务,支持 52 种语言,兼容 OpenAI API 与阿里云语音 API。

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages

  • Python 84.4%
  • HTML 9.6%
  • Shell 6.0%