Streaming-Zipformer / README.md
Luigi's picture
update readme
53fe0cb
metadata
title: Streaming Zipformer
emoji: πŸ‘€
colorFrom: blue
colorTo: purple
sdk: docker
pinned: false
license: mit
short_description: Streaming zipformer

πŸŽ™οΈ Real-Time Streaming ASR Demo (FastAPI + Sherpa-ONNX)

This project demonstrates a real-time speech-to-text (ASR) web application with:

  • 🧠 Sherpa-ONNX streaming Zipformer model
  • πŸš€ FastAPI backend with WebSocket support
  • πŸŽ›οΈ Configurable browser-based UI using vanilla HTML/JS
  • ☁️ Docker-compatible deployment (CPU-only) on Hugging Face Spaces

πŸ“¦ Model

The app uses the bilingual (Chinese-English) streaming Zipformer model:

πŸ”— Model Source: Zipformer Small Bilingual zh-en (2023-02-16)

Model files (ONNX) are located under:

models/zipformer_bilingual/

πŸš€ Features

  • 🎀 Real-Time Microphone Input: capture audio directly in the browser.
  • πŸŽ›οΈ Recognition Settings: select ASR model and precision; view supported languages and model size.
  • πŸ”‘ Hotword Biasing: input custom hotwords (one per line) and adjust boost score. See Sherpa-ONNX Hotwords Guide.
  • ⏱️ Endpoint Detection: configure silence-based rules (Rule 1 threshold, Rule 2 threshold, minimum utterance length) to control segmentation. See Sherpa-NCNN Endpoint Detection.
  • πŸ“Š Volume Meter: real-time volume indicator based on RMS.
  • πŸ’¬ Streaming Transcription: display partial (in red) and final (in green) results with automatic scrolling.
  • πŸ› οΈ Debug Logging: backend logs configuration steps and endpoint detection events.
  • 🐳 Deployment: Dockerfile provided for CPU-only deployment on Hugging Face Spaces.

πŸ› οΈ Configuration Guide

πŸ”‘ Hotword Biasing Configuration

  • Hotwords List (hotwordsList): Enter one hotword or phrase per line. These are words/phrases the ASR will preferentially recognize. For multilingual models, you can mix scripts according to your model’s modeling-unit (e.g., cjkchar+bpe).

  • Boost Score (boostScore): A global score applied at the token level for each matched hotword (range: 0.0–10.0). You may also specify per-hotword scores inline in the list using :, for example:

    θ―­ιŸ³θ―†εˆ« :3.5
    ζ·±εΊ¦ε­¦δΉ  :2.0
    SPEECH RECOGNITION :1.5
    
  • Decoding Method: Ensure your model uses modified_beam_search (not the default greedy_search) to enable hotword biasing.

  • Applying: Click Apply Hotwords in the UI to send the following JSON payload to the backend:

    {
      "type": "config",
      "hotwords": ["..."],
      "hotwordsScore": 2.0
    }
    

(For full details, see the Sherpa-ONNX Hotwords Guide (k2-fsa.github.io).)

⏱️ Endpoint Detection Configuration

The system supports three endpointing rules borrowed from Kaldi:

  • Rule 1 (epRule1): Minimum duration of trailing silence to trigger an endpoint, in seconds (default: 2.4). Fires whether or not any token has been decoded.

  • Rule 2 (epRule2): Minimum duration of trailing silence to trigger an endpoint only after at least one token is decoded, in seconds (default: 1.2).

  • Rule 3 (epRule3): Maximum utterance length before forcing an endpoint, in milliseconds (default: 300). Disable by setting a very large value.

  • Applying: Click Apply Endpoint Config in the UI to send the following JSON payload to the backend:

    {
      "type": "config",
      "epRule1": 2.4,
      "epRule2": 1.2,
      "epRule3": 300
    }
    

(See the Sherpa-NCNN Endpointing documentation (k2-fsa.github.io).)

πŸ§ͺ Local Development

  1. Install dependencies
pip install -r requirements.txt
  1. Run the app locally
uvicorn app.main:app --reload --host 0.0.0.0 --port 8501

Open http://localhost:8501 in your browser.

https://k2-fsa.github.io/sherpa/ncnn/endpoint.html

πŸ“ Project Structure

.
β”œβ”€β”€ app
β”‚   β”œβ”€β”€ main.py               # FastAPI + WebSocket endpoint, config parsing, debug logging
β”‚   β”œβ”€β”€ asr_worker.py         # Audio resampling, inference, endpoint detection, OpenCC conversion
β”‚   └── static/index.html     # Client-side UI: recognition, hotword, endpoint, mic, transcript
β”œβ”€β”€ models/zipformer_bilingual/
β”‚   └── ... (onnx, tokens.txt)
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
└── README.md

πŸ”§ Credits