Open-source speech understanding toolkit
Production-ready ASR, VAD, punctuation, speaker diarization, emotion detection, and audio event recognition with one unified Python interface.
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh",
vad_model="fsmn-vad",
punc_model="ct-punc",
spk_model="cam++",
)
res = model.generate(input="meeting.wav")
print(res[0]["sentence_info"])
The public docs now track the latest README and main-branch capabilities.
2-3x faster LLM decoding for Fun-ASR-Nano, with tensor parallel batch inference and real-time WebSocket service.
funasr-server exposes OpenAI-compatible transcription APIs; MCP and voice-input examples connect local ASR to AI tools.
Long-form benchmark results cover SenseVoice, Paraformer, Fun-ASR-Nano, GLM-ASR, and Whisper variants on GPU and CPU.
Run an OpenAI-compatible transcription endpoint locally, then plug it into agents, apps, and batch pipelines without sending audio to a cloud ASR provider.
Use SenseVoice for a fast first test, or switch models after the endpoint is running.
pip install funasr fastapi uvicorn python-multipart funasr-server --model sensevoice --device cuda
Download a public sample and call the same route used by OpenAI-compatible clients.
curl -L https://isv-data.oss-cn-hangzhou.aliyuncs.com/ics/MaaS/ASR/test_audio/BAC009S0764W0121.wav -o sample.wav curl http://localhost:8000/v1/audio/transcriptions \ -F file=@sample.wav \ -F model=sensevoice \ -F response_format=verbose_json
Start with examples, tune on your own data, extend the registry, or jump into source-linked API docs.
Pick SenseVoice, Paraformer, Fun-ASR-Nano, streaming runtime, or OpenAI API aliases for your workload.
DeployChoose between Python API, OpenAI API, Docker Compose, Kubernetes, WebSocket runtime, vLLM, MCP, batch, subtitles, and Triton.
CompareEvaluate FunASR against Whisper or cloud ASR with feature mapping, representative benchmarks, and rollout checks.
ChooseFind the fastest path for private APIs, agents, streaming, vLLM, subtitles, batch jobs, and benchmarks.
LearnInstall FunASR, choose a model, and run common ASR, VAD, diarization, and export flows.
TunePrepare JSONL data, fine-tune Paraformer, SenseVoice, and Fun-ASR-Nano, then monitor runs.
ExtendUnderstand the registry, add a model, package remote code, and avoid integration pitfalls.
AccelerateRun LLM-based ASR with vLLM, tensor parallel batch decoding, streaming SDK, and WebSocket service.
IntegrateExpose FunASR as an OpenAI-compatible endpoint, low-code workflow node, MCP tool, voice input, or subtitle generator.
MeasureCompare FunASR and Whisper on long-form audio, including GPU and CPU speed/CER numbers.
ReferenceBrowse generated classes, methods, source previews, and GitHub line links.
Everything needed for speech understanding, from raw audio segmentation to speaker-aware transcripts.
Streaming and offline ASR with VAD segmentation. Process long-form audio with a single API call.
Fun-ASR-Nano covers 31 languages and Qwen3-ASR covers 52 languages with language detection.
Identify who spoke when, then attach speaker labels to sentence-level ASR output.
SenseVoice detects emotion and audio events including background music, applause, laughter, and crying.
Non-autoregressive models support fast batch and realtime workloads across common deployment targets.
Fine-tune with DeepSpeed, export to ONNX, and deploy through Docker runtime or the Python SDK.
Pre-trained industrial models ready for recognition, segmentation, and speech understanding workflows.
End-to-end ASR trained on tens of millions of hours. 31 languages, dialects, accents, lyrics, timestamps, and hotwords.
Non-autoregressive Chinese and English ASR with streaming and offline variants for production systems.
Multi-task speech understanding for ASR, language ID, emotion, and audio events across five languages.
LLM-based ASR with 52 languages, contextual understanding, and automatic language detection.
Install locally, or run the Colab quickstart first to transcribe a sample in your browser.
pip install funasr # Or latest: pip install git+https://github.com/modelscope/FunASR.git
from funasr import AutoModel
model = AutoModel(
model="paraformer-zh",
vad_model="fsmn-vad",
punc_model="ct-punc",
spk_model="cam++",
)
res = model.generate(input="meeting.wav", batch_size_s=300)
for sent in res[0]["sentence_info"]:
print(f"[Speaker {sent['spk']}] {sent['text']}")
Related projects around ASR, speech understanding, video clipping, and voice generation.
The latest ASR large model with multilingual recognition, timestamps, speaker diarization, and hotwords.
Multi-task speech understanding for ASR, emotion detection, and audio event recognition.
AI video clipping powered by FunASR and LLM-assisted editing workflows.
Natural speech generation with multi-language, timbre, and emotion control.