metadata

title: Streaming Zipformer
emoji: 👀
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
short_description: Streaming zipformer

🎙️ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)

This project demonstrates a real-time speech-to-text (ASR) web application with:

🧠 Sherpa-ONNX streaming Zipformer model
🚀 FastAPI backend with WebSocket support
🎛️ Configurable browser-based UI using vanilla HTML/JS
☁️ Docker-compatible deployment (CPU-only) on Hugging Face Spaces

📦 Model

The app uses the bilingual (Chinese-English) streaming Zipformer model:

🔗 Model Source: Zipformer Small Bilingual zh-en (2023-02-16)

Model files (ONNX) are located under:

models/zipformer_bilingual/

🚀 Features

🎤 Real-Time Microphone Input: capture audio directly in the browser.
🎛️ Recognition Settings: select ASR model and precision; view supported languages and model size.
🔑 Hotword Biasing: input custom hotwords (one per line) and adjust boost score. See Sherpa-ONNX Hotwords Guide.
⏱️ Endpoint Detection: configure silence-based rules (Rule 1 threshold, Rule 2 threshold, minimum utterance length) to control segmentation. See Sherpa-NCNN Endpoint Detection.
📊 Volume Meter: real-time volume indicator based on RMS.
💬 Streaming Transcription: display partial (in red) and final (in green) results with automatic scrolling.
🛠️ Debug Logging: backend logs configuration steps and endpoint detection events.
🐳 Deployment: Dockerfile provided for CPU-only deployment on Hugging Face Spaces.

🛠️ Configuration Guide

🔑 Hotword Biasing Configuration

Hotwords List (hotwordsList): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your model’s modeling-unit (e.g., cjkchar+bpe).
Boost Score (boostScore): A global score applied at the token level for each matched hotword (range: 0.0–10.0). You may also specify per-hotword scores inline in the list using :, for example:
```
语音识别 :3.5
深度学习 :2.0
SPEECH RECOGNITION :1.5
```
Decoding Method: Ensure your model uses modified_beam_search (not the default greedy_search) to enable hotword biasing.
Applying: Click Apply Hotwords in the UI to send the following JSON payload to the backend:
```
{
  "type": "config",
  "hotwords": ["..."],
  "hotwordsScore": 2.0
}
```

(For full details, see the Sherpa-ONNX Hotwords Guide (k2-fsa.github.io).)

⏱️ Endpoint Detection Configuration

The system supports three endpointing rules borrowed from Kaldi:

Rule 1 (epRule1): Minimum duration of trailing silence to trigger an endpoint, in seconds (default: 2.4). Fires whether or not any token has been decoded.
Rule 2 (epRule2): Minimum duration of trailing silence to trigger an endpoint only after at least one token is decoded, in seconds (default: 1.2).
Rule 3 (epRule3): Maximum utterance length before forcing an endpoint, in milliseconds (default: 300). Disable by setting a very large value.
Applying: Click Apply Endpoint Config in the UI to send the following JSON payload to the backend:
```
{
  "type": "config",
  "epRule1": 2.4,
  "epRule2": 1.2,
  "epRule3": 300
}
```

(See the Sherpa-NCNN Endpointing documentation (k2-fsa.github.io).)

🧪 Local Development

Install dependencies

pip install -r requirements.txt

Run the app locally

uvicorn app.main:app --reload --host 0.0.0.0 --port 8501

Open http://localhost:8501 in your browser.

https://k2-fsa.github.io/sherpa/ncnn/endpoint.html

📁 Project Structure

.
├── app
│   ├── main.py               # FastAPI + WebSocket endpoint, config parsing, debug logging
│   ├── asr_worker.py         # Audio resampling, inference, endpoint detection, OpenCC conversion
│   └── static/index.html     # Client-side UI: recognition, hotword, endpoint, mic, transcript
├── models/zipformer_bilingual/
│   └── ... (onnx, tokens.txt)
├── requirements.txt
├── Dockerfile
└── README.md

Spaces:

Luigi
/

Streaming-Zipformer

Running