Spaces:
Running
Running
File size: 5,239 Bytes
a43e86a cd954ca 53fe0cb cd954ca 53fe0cb cd954ca 53fe0cb cd954ca 53fe0cb cd954ca 53fe0cb cd954ca 53fe0cb cd954ca 53fe0cb cd954ca 53fe0cb cd954ca 53fe0cb cd954ca 53fe0cb cd954ca 53fe0cb cd954ca |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 |
---
title: Streaming Zipformer
emoji: π
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
short_description: Streaming zipformer
---
# ποΈ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)
This project demonstrates a real-time speech-to-text (ASR) web application with:
* π§ [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model
* π FastAPI backend with WebSocket support
* ποΈ Configurable browser-based UI using vanilla HTML/JS
* βοΈ Docker-compatible deployment (CPU-only) on Hugging Face Spaces
## π¦ Model
The app uses the bilingual (Chinese-English) streaming Zipformer model:
π **Model Source:** [Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english)
Model files (ONNX) are located under:
```
models/zipformer_bilingual/
```
## π Features
* π€ **Real-Time Microphone Input:** capture audio directly in the browser.
* ποΈ **Recognition Settings:** select ASR model and precision; view supported languages and model size.
* π **Hotword Biasing:** input custom hotwords (one per line) and adjust boost score. See [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html).
* β±οΈ **Endpoint Detection:** configure silence-based rules (RuleΒ 1 threshold, RuleΒ 2 threshold, minimum utterance length) to control segmentation. See [Sherpa-NCNN Endpoint Detection](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html).
* π **Volume Meter:** real-time volume indicator based on RMS.
* π¬ **Streaming Transcription:** display partial (in red) and final (in green) results with automatic scrolling.
* π οΈ **Debug Logging:** backend logs configuration steps and endpoint detection events.
* π³ **Deployment:** Dockerfile provided for CPU-only deployment on Hugging Face Spaces.
## π οΈ Configuration Guide
### π Hotword Biasing Configuration
* **Hotwords List** (`hotwordsList`): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your modelβs `modeling-unit` (e.g., `cjkchar+bpe`).
* **Boost Score** (`boostScore`): A global score applied at the token level for each matched hotword (range: `0.0`β`10.0`). You may also specify per-hotword scores inline in the list using `:`, for example:
```
θ―ι³θ―ε« :3.5
ζ·±εΊ¦ε¦δΉ :2.0
SPEECH RECOGNITION :1.5
```
* **Decoding Method**: Ensure your model uses `modified_beam_search` (not the default `greedy_search`) to enable hotword biasing.
* **Applying**: Click **Apply Hotwords** in the UI to send the following JSON payload to the backend:
```json
{
"type": "config",
"hotwords": ["..."],
"hotwordsScore": 2.0
}
```
(For full details, see the [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html)).)
### β±οΈ Endpoint Detection Configuration
The system supports three endpointing rules borrowed from Kaldi:
* **RuleΒ 1** (`epRule1`): Minimum duration of trailing silence to trigger an endpoint, in **seconds** (default: `2.4`). Fires whether or not any token has been decoded.
* **RuleΒ 2** (`epRule2`): Minimum duration of trailing silence to trigger an endpoint *only after* at least one token is decoded, in **seconds** (default: `1.2`).
* **RuleΒ 3** (`epRule3`): Maximum utterance length before forcing an endpoint, in **milliseconds** (default: `300`). Disable by setting a very large value.
* **Applying**: Click **Apply Endpoint Config** in the UI to send the following JSON payload to the backend:
```json
{
"type": "config",
"epRule1": 2.4,
"epRule2": 1.2,
"epRule3": 300
}
```
(See the [Sherpa-NCNN Endpointing documentation](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)).)
## π§ͺ Local Development
1. **Install dependencies**
```bash
pip install -r requirements.txt
```
2. **Run the app locally**
```bash
uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
```
Open [http://localhost:8501](http://localhost:8501) in your browser.
[https://k2-fsa.github.io/sherpa/ncnn/endpoint.html](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)
## π Project Structure
```
.
βββ app
β βββ main.py # FastAPI + WebSocket endpoint, config parsing, debug logging
β βββ asr_worker.py # Audio resampling, inference, endpoint detection, OpenCC conversion
β βββ static/index.html # Client-side UI: recognition, hotword, endpoint, mic, transcript
βββ models/zipformer_bilingual/
β βββ ... (onnx, tokens.txt)
βββ requirements.txt
βββ Dockerfile
βββ README.md
```
## π§ Credits
* [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx)
* [OpenCC](https://github.com/BYVoid/OpenCC)
* [FastAPI](https://fastapi.tiangolo.com/)
* [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)
|