File size: 5,239 Bytes
a43e86a
 
 
 
 
 
 
 
 
 
 
cd954ca
 
53fe0cb
cd954ca
 
 
53fe0cb
 
cd954ca
 
 
53fe0cb
cd954ca
53fe0cb
cd954ca
 
 
 
 
 
 
 
 
53fe0cb
 
 
 
 
 
 
 
cd954ca
53fe0cb
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
cd954ca
 
 
53fe0cb
cd954ca
 
 
 
 
53fe0cb
cd954ca
 
 
 
 
53fe0cb
cd954ca
53fe0cb
cd954ca
 
 
 
 
 
53fe0cb
 
 
cd954ca
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
---
title: Streaming Zipformer
emoji: πŸ‘€
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
short_description: Streaming zipformer
---

# πŸŽ™οΈ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)

This project demonstrates a real-time speech-to-text (ASR) web application with:

* 🧠 [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx) streaming Zipformer model
* πŸš€ FastAPI backend with WebSocket support
* πŸŽ›οΈ Configurable browser-based UI using vanilla HTML/JS
* ☁️ Docker-compatible deployment (CPU-only) on Hugging Face Spaces

## πŸ“¦ Model

The app uses the bilingual (Chinese-English) streaming Zipformer model:

πŸ”— **Model Source:** [Zipformer Small Bilingual zh-en (2023-02-16)](https://k2-fsa.github.io/sherpa/onnx/pretrained_models/online-transducer/zipformer-transducer-models.html#sherpa-onnx-streaming-zipformer-small-bilingual-zh-en-2023-02-16-bilingual-chinese-english)

Model files (ONNX) are located under:

```
models/zipformer_bilingual/
```

## πŸš€ Features

* 🎀 **Real-Time Microphone Input:** capture audio directly in the browser.
* πŸŽ›οΈ **Recognition Settings:** select ASR model and precision; view supported languages and model size.
* πŸ”‘ **Hotword Biasing:** input custom hotwords (one per line) and adjust boost score. See [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html).
* ⏱️ **Endpoint Detection:** configure silence-based rules (Rule 1 threshold, Rule 2 threshold, minimum utterance length) to control segmentation. See [Sherpa-NCNN Endpoint Detection](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html).
* πŸ“Š **Volume Meter:** real-time volume indicator based on RMS.
* πŸ’¬ **Streaming Transcription:** display partial (in red) and final (in green) results with automatic scrolling.
* πŸ› οΈ **Debug Logging:** backend logs configuration steps and endpoint detection events.
* 🐳 **Deployment:** Dockerfile provided for CPU-only deployment on Hugging Face Spaces.

## πŸ› οΈ Configuration Guide

### πŸ”‘ Hotword Biasing Configuration

* **Hotwords List** (`hotwordsList`): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your model’s `modeling-unit` (e.g., `cjkchar+bpe`).
* **Boost Score** (`boostScore`): A global score applied at the token level for each matched hotword (range: `0.0`–`10.0`). You may also specify per-hotword scores inline in the list using `:`, for example:

  ```
  θ―­ιŸ³θ―†εˆ« :3.5
  ζ·±εΊ¦ε­¦δΉ  :2.0
  SPEECH RECOGNITION :1.5
  ```
* **Decoding Method**: Ensure your model uses `modified_beam_search` (not the default `greedy_search`) to enable hotword biasing.
* **Applying**: Click **Apply Hotwords** in the UI to send the following JSON payload to the backend:

  ```json
  {
    "type": "config",
    "hotwords": ["..."],
    "hotwordsScore": 2.0
  }
  ```

(For full details, see the [Sherpa-ONNX Hotwords Guide](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/onnx/hotwords/index.html)).)

### ⏱️ Endpoint Detection Configuration

The system supports three endpointing rules borrowed from Kaldi:

* **RuleΒ 1** (`epRule1`): Minimum duration of trailing silence to trigger an endpoint, in **seconds** (default: `2.4`). Fires whether or not any token has been decoded.
* **RuleΒ 2** (`epRule2`): Minimum duration of trailing silence to trigger an endpoint *only after* at least one token is decoded, in **seconds** (default: `1.2`).
* **RuleΒ 3** (`epRule3`): Maximum utterance length before forcing an endpoint, in **milliseconds** (default: `300`). Disable by setting a very large value.
* **Applying**: Click **Apply Endpoint Config** in the UI to send the following JSON payload to the backend:

  ```json
  {
    "type": "config",
    "epRule1": 2.4,
    "epRule2": 1.2,
    "epRule3": 300
  }
  ```

(See the [Sherpa-NCNN Endpointing documentation](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html) ([k2-fsa.github.io](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)).)

## πŸ§ͺ Local Development

1. **Install dependencies**

```bash
pip install -r requirements.txt
```

2. **Run the app locally**

```bash
uvicorn app.main:app --reload --host 0.0.0.0 --port 8501
```

Open [http://localhost:8501](http://localhost:8501) in your browser.

[https://k2-fsa.github.io/sherpa/ncnn/endpoint.html](https://k2-fsa.github.io/sherpa/ncnn/endpoint.html)

## πŸ“ Project Structure

```
.
β”œβ”€β”€ app
β”‚   β”œβ”€β”€ main.py               # FastAPI + WebSocket endpoint, config parsing, debug logging
β”‚   β”œβ”€β”€ asr_worker.py         # Audio resampling, inference, endpoint detection, OpenCC conversion
β”‚   └── static/index.html     # Client-side UI: recognition, hotword, endpoint, mic, transcript
β”œβ”€β”€ models/zipformer_bilingual/
β”‚   └── ... (onnx, tokens.txt)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
└── README.md
```

## πŸ”§ Credits

* [Sherpa-ONNX](https://github.com/k2-fsa/sherpa-onnx)
* [OpenCC](https://github.com/BYVoid/OpenCC)
* [FastAPI](https://fastapi.tiangolo.com/)
* [Hugging Face Spaces](https://huggingface.co/docs/hub/spaces)