Spaces:

Luigi
/

Qwen2.5-Omni-3B-ASR

Running on Zero

App Files Files Community

Luigi commited on Jun 3

Commit

920ed31

1 Parent(s): b429d7c

initial commit

Browse files

Files changed (3) hide show

README.md +380 -0
app.py +118 -0
requirements.txt +15 -0

README.md CHANGED Viewed

@@ -12,3 +12,383 @@ short_description: Qwen2.5 Omni 3B ASR DEMO
 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

 ---
 Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
+# Qwen2.5-Omni ASR (ZeroGPU) Gradio App
+A lightweight Gradio application that leverages Qwen2.5-Omni’s audio-to-text capabilities to perform automatic speech recognition (ASR) on uploaded audio files, then converts the simplified Chinese output to Traditional Chinese. This project is optimized with ZeroGPU for CPU/GPU offload acceleration, enabling efficient deployment on Hugging Face Spaces without requiring a dedicated GPU.
+---
+## Overview
+* **Model:** Qwen2.5-Omni-3B
+* **Processor:** Qwen2.5-Omni processor (handles tokenization and chat-template formatting)
+* **Audio/Video Preprocessing:** `qwen-omni-utils` (handles loading and resampling)
+* **Simplified→Traditional Conversion:** `opencc`
+* **Web UI:** Gradio v5 (blocks API)
+* **ZeroGPU:** Hugging Face’s offload wrapper (`spaces` package) to transparently dispatch tensors between CPU and available GPU (if any)
+When a user uploads an audio file and provides a (customizable) user prompt like “Transcribe the attached audio to text with punctuation,” the app builds the exact same chat messages that Qwen2.5-Omni expects (including a system prompt under the hood), runs inference via ZeroGPU, and returns only the ASR transcript—stripped of internal “system … user … assistant” markers—converted into Traditional Chinese.
+---
+## Features
+1. **Audio-to-Text with Qwen2.5-Omni**
+   * Uses the official Qwen2.5-Omni model (3B parameters) to generate a punctuated transcript from arbitrary audio formats (WAV, MP3, etc.).
+2. **ZeroGPU Acceleration**
+   * Automatically offloads model weights and activations between CPU and GPU, allowing low-resource deployment on Hugging Face Spaces without requiring a full-sized GPU.
+3. **Simplified→Traditional Chinese Conversion**
+   * Applies OpenCC (“s2t”) to convert simplified Chinese output into Traditional Chinese in a single step.
+4. **Clean Transcript Output**
+   * Internal “system”, “user”, and “assistant” prefixes are stripped before display, so end users see only the actual ASR text.
+5. **Gradio Blocks UI (v5)**
+   * Simple two-column layout: upload your audio on the left, enter a prompt on the left, click Transcribe, and view the Traditional Chinese transcript on the right.
+---
+## Demo
+![App Screenshot](https://user-provide-your-own-screenshot-url) <!-- Optional: insert a screenshot link or remove this line -->
+1. **Upload Audio**: Click “Browse” or drag & drop a WAV/MP3/… file.
+2. **User Prompt**: By default, it is set to
+   ```
+   Transcribe the attached audio to text with punctuation.
+   ```
+   You can customize this if you want a different style of transcription (e.g., “Add speaker labels,” “Transcribe and summarize,” etc.).
+3. **Transcribe**: Hit “Transcribe” (ZeroGPU handles device placement automatically).
+4. **Output**: The Traditional Chinese transcript appears in the right textbox—cleaned of any system/user/assistant markers.
+---
+## Installation & Local Run
+1. **Clone the Repository**
+   ```bash
+   git clone https://github.com/<your-username>/qwen2-omni-asr-zerogpu.git
+   cd qwen2-omni-asr-zerogpu
+   ```
+2. **Create a Python Virtual Environment** (recommended)
+   ```bash
+   python3 -m venv venv
+   source venv/bin/activate
+   ```
+3. **Install Dependencies**
+   ```bash
+   pip install --upgrade pip
+   pip install -r requirements.txt
+   ```
+4. **Run the App Locally**
+   ```bash
+   python app.py
+   ```
+   * This starts a Gradio server on `http://127.0.0.1:7860/` (by default).
+   * ZeroGPU will automatically detect if you have a CUDA device or will fall back to CPU if not.
+---
+## Deployment on Hugging Face Spaces
+1. Create a new Space on Hugging Face (use the Python/Jupyter template).
+2. Ensure you select **“Hardware Accelerator: None”** (Spaces will use ZeroGPU to offload automatically).
+3. Push (or upload) the repository contents, including:
+   * `app.py`
+   * `requirements.txt`
+   * Any other config files (e.g., `README.md` itself).
+4. Spaces will install dependencies via `requirements.txt`, and automatically launch `app.py` under ZeroGPU.
+5. Visit your Space’s URL to try it out.
+*No explicit `Dockerfile` or server config is needed; ZeroGPU handles the backend. Just ensure `spaces` is in `requirements.txt`.*
+---
+## File Structure
+```
+├── app.py
+├── requirements.txt
+├── README.md
+└── LICENSE  (optional)
+```
+* **app.py**
+  * Entry point for the Gradio app.
+  * Defines `run_asr(...)` decorated with `@spaces.GPU` to enable ZeroGPU offload.
+  * Loads the Qwen2.5-Omni model & processor, runs audio preprocessing, inference, decoding, prompt stripping, and Simplified→Traditional conversion.
+  * Builds a Gradio Blocks UI (two-column layout).
+* **requirements.txt**
+  ```text
+  # ZeroGPU for CPU-/GPU offload acceleration
+  spaces
+  # PyTorch + Transformers
+  torch
+  transformers
+  # Qwen Omni utilities (for audio preprocessing)
+  qwen-omni-utils
+  # OpenCC (simplified→traditional conversion)
+  opencc
+  # Gradio v5
+  gradio>=5.0.0
+  ```
+* **README.md**
+  * (You’re reading it.)
+---
+## How It Works
+1. **Model & Processor Loading**
+   ```python
+   MODEL_ID = "Qwen/Qwen2.5-Omni-3B"
+   model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
+       MODEL_ID, torch_dtype="auto", device_map="auto"
+   )
+   model.disable_talker()
+   processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
+   model.eval()
+   ```
+   * `device_map="auto"` + `@spaces.GPU` (ZeroGPU decorator) ensure that, if a GPU is present, weights are offloaded to GPU; otherwise stay on CPU.
+   * `disable_talker()` removes any “talker” head to focus purely on ASR.
+2. **Message Construction for ASR**
+   ```python
+   sys_prompt = (
+       "You are Qwen, a virtual human developed by the Qwen Team, "
+       "Alibaba Group, capable of perceiving auditory and visual inputs, "
+       "as well as generating text and speech."
+   )
+   messages = [
+       {"role": "system", "content": [{"type": "text", "text": sys_prompt}]},
+       {
+           "role": "user",
+           "content": [
+               {"type": "audio", "audio": audio_path},
+               {"type": "text", "text": user_prompt}
+           ],
+       },
+   ]
+   ```
+   * This mirrors the Qwen chat template: first a system message, then a user message containing an uploaded audio file + a textual instruction.
+3. **Apply Chat Template & Preprocess**
+   ```python
+   text_input = processor.apply_chat_template(
+       messages, tokenize=False, add_generation_prompt=True
+   )
+   audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
+   inputs = processor(
+       text=text_input,
+       audio=audios,
+       images=images,
+       videos=videos,
+       return_tensors="pt",
+       padding=True,
+       use_audio_in_video=True
+   ).to(model.device).to(model.dtype)
+   ```
+   * `apply_chat_template(...)` formats the messages into a single input string.
+   * `process_mm_info(...)` handles loading & resampling of audio (and potentially extracting video frames, if video files are provided).
+   * The final `inputs` tensor dict is ready for `model.generate()`.
+4. **Inference & Post-Processing**
+   ```python
+   output_tokens = model.generate(
+       **inputs,
+       use_audio_in_video=True,
+       return_audio=False,
+       thinker_max_new_tokens=512,
+       thinker_do_sample=False
+   )
+   full_decoded = processor.batch_decode(
+       output_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False
+   )[0].strip()
+   asr_only = _strip_prompts(full_decoded)
+   return cc.convert(asr_only)
+   ```
+   * `model.generate(...)` runs a greedy (no sampling) decoding over up to 512 new tokens.
+   * `batch_decode(...)` yields a single string that includes all “system … user … assistant” markers.
+   * `_strip_prompts(...)` finds the first occurrence of `assistant` in that output and returns only the substring after it, so that the UI sees just the raw transcript.
+   * Finally, `opencc` converts that transcript from simplified to Traditional Chinese.
+---
+## Dependencies
+All required dependencies are listed in `requirements.txt`. Briefly:
+* **spaces**: Offload wrapper (ZeroGPU) to auto-dispatch tensors between CPU/GPU.
+* **torch** & **transformers**: Core PyTorch framework and Hugging Face Transformers (to load Qwen2.5-Omni).
+* **qwen-omni-utils**: Utility functions to preprocess audio/video for Qwen2.5-Omni.
+* **opencc**: Simplified→Traditional Chinese converter (uses the “s2t” config).
+* **gradio >= 5.0.0**: For building the web UI.
+When you run `pip install -r requirements.txt`, all dependencies will be pulled from PyPI.
+---
+## Configuration
+* **Model ID**
+  * Defined in `app.py` as `MODEL_ID = "Qwen/Qwen2.5-Omni-3B"`.
+  * If you want to try a smaller (or larger) Qwen2.5 model, simply update that string to another HF model repository (e.g., `"Qwen/Qwen2.5-Omni-1B"`), then re-deploy.
+* **ZeroGPU Offload**
+  * The `@spaces.GPU` decorator on `run_asr(...)` is all you need to enable transparent offloading.
+  * No extra config or environment variables are required. Spaces will detect this, install `spaces`, and manage CPU/GPU placement.
+* **Prompt Customization**
+  * By default, the textbox placeholder is
+    > “Transcribe the attached audio to text with punctuation.”
+  * You can customize this string directly in the Gradio component. If you omit the prompt entirely, `run_asr` will still run but may not add punctuation; it’s highly recommended to always provide a user prompt.
+---
+## Project Structure
+```text
+qwen2-omni-asr-zerogpu/
+├── app.py            # Main application code (Gradio + inference logic)
+├── requirements.txt  # All Python dependencies
+├── README.md         # This file
+└── LICENSE           # (Optional) License, if you wish to open-source
+```
+* **app.py**
+  * Imports: `spaces`, `torch`, `transformers`, `qwen_omni_utils`, `opencc`, `gradio`.
+  * Defines a helper `_strip_prompts()` to remove system/user/assistant markers.
+  * Implements `run_asr(...)` decorated with `@spaces.GPU`.
+  * Builds Gradio Blocks UI (with `gr.Row()`, `gr.Column()`, etc.).
+* **requirements.txt**
+  * Must include exactly what’s needed to run on Spaces (and locally).
+  * ZeroGPU (the `spaces` package) should be first, so that Spaces’s auto-offload wrapper is installed.
+---
+## Usage Examples
+1. **Local Testing**
+   ```bash
+   python app.py
+   ```
+   * Open your browser to `http://127.0.0.1:7860/`
+   * Upload a short `.wav` or `.mp3` file (in Chinese) and click “Transcribe.”
+   * Verify that the output is properly punctuated, in Traditional Chinese, and free of system/user prefixes.
+2. **Command-Line Invocation**
+   Although the main interface is Gradio, you can also import `run_asr` directly in a Python shell to run a single file:
+   ```python
+   from app import run_asr
+   transcript = run_asr("path/to/audio.wav", "Transcribe the audio with punctuation.")
+   print(transcript)  # → Traditional Chinese transcript
+   ```
+3. **Hugging Face Spaces**
+   * Ensure the repo is pushed to a Space (no special hardware required).
+   * The web UI will appear under your Space’s URL (e.g., `https://huggingface.co/spaces/your-username/qwen2-omni-asr-zerogpu`).
+   * End users simply upload audio and click “Transcribe.”
+---
+## Troubleshooting
+* **“Please upload an audio file first.”**
+  * This warning is returned if you click “Transcribe” without uploading a valid audio path.
+* **Model-not-registered / FunASR Errors**
+  * If you see errors about “model not registered,” make sure you have the latest `qwen-omni-utils` version and check your internet connectivity (HF model downloads).
+* **ZeroGPU Fallback**
+  * If no GPU is detected, ZeroGPU will automatically run inference on CPU. Performance will be slower, but functionality remains identical.
+* **Output Contains “system … user … assistant”**
+  * If you still see system/user/assistant text, check that `_strip_prompts()` is present in `app.py` and is being applied to `full_decoded`.
+---
+## Contributing
+1. **Fork the Repository**
+2. **Create a New Branch**
+   ```bash
+   git checkout -b feature/my-enhancement
+   ```
+3. **Make Your Changes**
+   * Improve prompt-stripping logic, add new model IDs, or enhance the UI.
+   * If you add new Python dependencies, remember to update `requirements.txt`.
+4. **Test Locally**
+   ```bash
+   python app.py
+   ```
+5. **Push & Open a Pull Request**
+   * Describe your changes in detail.
+   * Ensure the README is updated if new features are added.
+---
+## License
+This project is open-source. You can choose a license of your preference (MIT / Apache 2.0 / etc.). If no license file is provided, the default is “All rights reserved by the author.”
+---
+## Acknowledgments
+* **Qwen Team (Alibaba)** for the Qwen2.5-Omni model.
+* **Hugging Face** for Transformers, Gradio, and ZeroGPU infrastructure (`spaces` package).
+* **OpenCC** for reliable Simplified→Traditional Chinese conversion.
+* **qwen-omni-utils** for audio/video preprocessing helpers.
+---
+Thank you for trying out the Qwen2.5-Omni ASR (ZeroGPU) Gradio App! If you run into any issues or have suggestions, feel free to open an Issue or Pull Request on GitHub.

app.py ADDED Viewed

	@@ -0,0 +1,118 @@

+import spaces
+import torch
+from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
+from qwen_omni_utils import process_mm_info
+from opencc import OpenCC
+import gradio as gr
+cc = OpenCC("s2t")
+# Load model & processor exactly as before
+MODEL_ID = "Qwen/Qwen2.5-Omni-3B"
+model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
+    MODEL_ID,
+    torch_dtype="auto",
+    device_map="auto"
+)
+model.disable_talker()
+processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
+model.eval()
+def _strip_prompts(full_text: str) -> str:
+    """
+    Remove “system … user … assistant” from the decoded string
+    so only the actual ASR transcript remains.
+    """
+    marker = "assistant"
+    if marker in full_text:
+        return full_text.split(marker, 1)[1].strip()
+    else:
+        return full_text.strip()
+@spaces.GPU
+def run_asr(
+    audio_path: str,
+    user_prompt: str
+) -> str:
+    if not audio_path:
+        return "⚠️ Please upload an audio file first."
+    # 1) Build the exact same messages
+    sys_prompt = 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'
+    messages = [
+        {"role": "system", "content": [{"type": "text", "text": sys_prompt}]},
+        {
+            "role": "user",
+            "content": [
+                {"type": "audio", "audio": audio_path},
+                {"type": "text",  "text": user_prompt}
+            ],
+        },
+    ]
+    # 2) Apply chat template
+    text_input = processor.apply_chat_template(
+        messages,
+        tokenize=False,
+        add_generation_prompt=True
+    )
+    # 3) Preprocess audio/video
+    audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
+    # 4) Tokenize & move tensors
+    inputs = processor(
+        text=text_input,
+        audio=audios,
+        images=images,
+        videos=videos,
+        return_tensors="pt",
+        padding=True,
+        use_audio_in_video=True
+    )
+    inputs = inputs.to(model.device).to(model.dtype)
+    # 5) Generate
+    output_tokens = model.generate(
+        **inputs,
+        use_audio_in_video=True,
+        return_audio=False,
+        thinker_max_new_tokens=512,
+        thinker_do_sample=False
+    )
+    # 6) Decode everything (system+user+assistant)
+    full_decoded = processor.batch_decode(
+        output_tokens,
+        skip_special_tokens=True,
+        clean_up_tokenization_spaces=False
+    )[0].strip()
+    # 7) Strip off the “system … user … assistant” prefix
+    asr_only = _strip_prompts(full_decoded)
+    # 8) Convert to Traditional Chinese and return
+    return cc.convert(asr_only)
+with gr.Blocks() as demo:
+    gr.Markdown("## Qwen2.5-Omni ASR → Audio to Punctuated Transcription (ZeroGPU)")
+    with gr.Row():
+        audio_input = gr.Audio(label="Upload Audio (WAV/MP3/…)", type="filepath")
+        user_input  = gr.Textbox(
+            label="User Prompt",
+            value="Transcribe the attached audio to text with punctuation."
+        )
+    submit_btn = gr.Button("Transcribe")
+    output_txt = gr.Textbox(label="Transcription (Traditional Chinese)")
+    submit_btn.click(
+        fn=run_asr,
+        inputs=[audio_input, user_input],
+        outputs=output_txt
+    )
+if __name__ == "__main__":
+    demo.queue()
+    demo.launch()

requirements.txt ADDED Viewed

	@@ -0,0 +1,15 @@

+# ZeroGPU for CPU-/GPU‐offload acceleration
+spaces
+# PyTorch + Transformers
+torch
+transformers
+# Qwen Omni utilities (for audio preprocessing)
+qwen-omni-utils
+# OpenCC (for simplified→traditional conversion)
+opencc
+# Gradio v5
+gradio>=5.0.0