Spaces:
Running
title: Streaming Zipformer
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
short_description: Streaming zipformer
ποΈ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)
This project demonstrates a real-time speech-to-text (ASR) web application with:
- π§ Sherpa-ONNX streaming Zipformer model
- π FastAPI backend with WebSocket support
- ποΈ Configurable browser-based UI using vanilla HTML/JS
- βοΈ Docker-compatible deployment (CPU-only) on Hugging Face Spaces
π¦ Model
The app uses the bilingual (Chinese-English) streaming Zipformer model:
π Model Source: Zipformer Small Bilingual zh-en (2023-02-16)
Model files (ONNX) are located under:
models/zipformer_bilingual/
π Features
- π€ Real-Time Microphone Input: capture audio directly in the browser.
- ποΈ Recognition Settings: select ASR model and precision; view supported languages and model size.
- π Hotword Biasing: input custom hotwords (one per line) and adjust boost score. See Sherpa-ONNX Hotwords Guide.
- β±οΈ Endpoint Detection: configure silence-based rules (Rule 1 threshold, Rule 2 threshold, minimum utterance length) to control segmentation. See Sherpa-NCNN Endpoint Detection.
- π Volume Meter: real-time volume indicator based on RMS.
- π¬ Streaming Transcription: display partial (in red) and final (in green) results with automatic scrolling.
- π οΈ Debug Logging: backend logs configuration steps and endpoint detection events.
- π³ Deployment: Dockerfile provided for CPU-only deployment on Hugging Face Spaces.
π οΈ Configuration Guide
π Hotword Biasing Configuration
Hotwords List (
hotwordsList
): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your modelβsmodeling-unit
(e.g.,cjkchar+bpe
).Boost Score (
boostScore
): A global score applied at the token level for each matched hotword (range:0.0
β10.0
). You may also specify per-hotword scores inline in the list using:
, for example:θ―ι³θ―ε« :3.5 ζ·±εΊ¦ε¦δΉ :2.0 SPEECH RECOGNITION :1.5
Decoding Method: Ensure your model uses
modified_beam_search
(not the defaultgreedy_search
) to enable hotword biasing.Applying: Click Apply Hotwords in the UI to send the following JSON payload to the backend:
{ "type": "config", "hotwords": ["..."], "hotwordsScore": 2.0 }
(For full details, see the Sherpa-ONNX Hotwords Guide (k2-fsa.github.io).)
β±οΈ Endpoint Detection Configuration
The system supports three endpointing rules borrowed from Kaldi:
Rule 1 (
epRule1
): Minimum duration of trailing silence to trigger an endpoint, in seconds (default:2.4
). Fires whether or not any token has been decoded.Rule 2 (
epRule2
): Minimum duration of trailing silence to trigger an endpoint only after at least one token is decoded, in seconds (default:1.2
).Rule 3 (
epRule3
): Maximum utterance length before forcing an endpoint, in milliseconds (default:300
). Disable by setting a very large value.Applying: Click Apply Endpoint Config in the UI to send the following JSON payload to the backend:
{ "type": "config", "epRule1": 2.4, "epRule2": 1.2, "epRule3": 300 }
(See the Sherpa-NCNN Endpointing documentation (k2-fsa.github.io).)
π§ͺ Local Development
- Install dependencies
pip install -r requirements.txt
- Run the app locally
uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
Open http://localhost:8501 in your browser.
https://k2-fsa.github.io/sherpa/ncnn/endpoint.html
π Project Structure
.
βββ app
β βββ main.py # FastAPI + WebSocket endpoint, config parsing, debug logging
β βββ asr_worker.py # Audio resampling, inference, endpoint detection, OpenCC conversion
β βββ static/index.html # Client-side UI: recognition, hotword, endpoint, mic, transcript
βββ models/zipformer_bilingual/
β βββ ... (onnx, tokens.txt)
βββ requirements.txt
βββ Dockerfile
βββ README.md