Luigi commited on
Commit
920ed31
·
1 Parent(s): b429d7c

initial commit

Browse files
Files changed (3) hide show
  1. README.md +380 -0
  2. app.py +118 -0
  3. requirements.txt +15 -0
README.md CHANGED
@@ -12,3 +12,383 @@ short_description: Qwen2.5 Omni 3B ASR DEMO
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
12
  ---
13
 
14
  Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
15
+
16
+ # Qwen2.5-Omni ASR (ZeroGPU) Gradio App
17
+
18
+ A lightweight Gradio application that leverages Qwen2.5-Omni’s audio-to-text capabilities to perform automatic speech recognition (ASR) on uploaded audio files, then converts the simplified Chinese output to Traditional Chinese. This project is optimized with ZeroGPU for CPU/GPU offload acceleration, enabling efficient deployment on Hugging Face Spaces without requiring a dedicated GPU.
19
+
20
+ ---
21
+
22
+ ## Overview
23
+
24
+ * **Model:** Qwen2.5-Omni-3B
25
+ * **Processor:** Qwen2.5-Omni processor (handles tokenization and chat-template formatting)
26
+ * **Audio/Video Preprocessing:** `qwen-omni-utils` (handles loading and resampling)
27
+ * **Simplified→Traditional Conversion:** `opencc`
28
+ * **Web UI:** Gradio v5 (blocks API)
29
+ * **ZeroGPU:** Hugging Face’s offload wrapper (`spaces` package) to transparently dispatch tensors between CPU and available GPU (if any)
30
+
31
+ When a user uploads an audio file and provides a (customizable) user prompt like “Transcribe the attached audio to text with punctuation,” the app builds the exact same chat messages that Qwen2.5-Omni expects (including a system prompt under the hood), runs inference via ZeroGPU, and returns only the ASR transcript—stripped of internal “system … user … assistant” markers—converted into Traditional Chinese.
32
+
33
+ ---
34
+
35
+ ## Features
36
+
37
+ 1. **Audio-to-Text with Qwen2.5-Omni**
38
+
39
+ * Uses the official Qwen2.5-Omni model (3B parameters) to generate a punctuated transcript from arbitrary audio formats (WAV, MP3, etc.).
40
+ 2. **ZeroGPU Acceleration**
41
+
42
+ * Automatically offloads model weights and activations between CPU and GPU, allowing low-resource deployment on Hugging Face Spaces without requiring a full-sized GPU.
43
+ 3. **Simplified→Traditional Chinese Conversion**
44
+
45
+ * Applies OpenCC (“s2t”) to convert simplified Chinese output into Traditional Chinese in a single step.
46
+ 4. **Clean Transcript Output**
47
+
48
+ * Internal “system”, “user”, and “assistant” prefixes are stripped before display, so end users see only the actual ASR text.
49
+ 5. **Gradio Blocks UI (v5)**
50
+
51
+ * Simple two-column layout: upload your audio on the left, enter a prompt on the left, click Transcribe, and view the Traditional Chinese transcript on the right.
52
+
53
+ ---
54
+
55
+ ## Demo
56
+
57
+ ![App Screenshot](https://user-provide-your-own-screenshot-url) <!-- Optional: insert a screenshot link or remove this line -->
58
+
59
+ 1. **Upload Audio**: Click “Browse” or drag & drop a WAV/MP3/… file.
60
+ 2. **User Prompt**: By default, it is set to
61
+
62
+ ```
63
+ Transcribe the attached audio to text with punctuation.
64
+ ```
65
+
66
+ You can customize this if you want a different style of transcription (e.g., “Add speaker labels,” “Transcribe and summarize,” etc.).
67
+ 3. **Transcribe**: Hit “Transcribe” (ZeroGPU handles device placement automatically).
68
+ 4. **Output**: The Traditional Chinese transcript appears in the right textbox—cleaned of any system/user/assistant markers.
69
+
70
+ ---
71
+
72
+ ## Installation & Local Run
73
+
74
+ 1. **Clone the Repository**
75
+
76
+ ```bash
77
+ git clone https://github.com/<your-username>/qwen2-omni-asr-zerogpu.git
78
+ cd qwen2-omni-asr-zerogpu
79
+ ```
80
+
81
+ 2. **Create a Python Virtual Environment** (recommended)
82
+
83
+ ```bash
84
+ python3 -m venv venv
85
+ source venv/bin/activate
86
+ ```
87
+
88
+ 3. **Install Dependencies**
89
+
90
+ ```bash
91
+ pip install --upgrade pip
92
+ pip install -r requirements.txt
93
+ ```
94
+
95
+ 4. **Run the App Locally**
96
+
97
+ ```bash
98
+ python app.py
99
+ ```
100
+
101
+ * This starts a Gradio server on `http://127.0.0.1:7860/` (by default).
102
+ * ZeroGPU will automatically detect if you have a CUDA device or will fall back to CPU if not.
103
+
104
+ ---
105
+
106
+ ## Deployment on Hugging Face Spaces
107
+
108
+ 1. Create a new Space on Hugging Face (use the Python/Jupyter template).
109
+ 2. Ensure you select **“Hardware Accelerator: None”** (Spaces will use ZeroGPU to offload automatically).
110
+ 3. Push (or upload) the repository contents, including:
111
+
112
+ * `app.py`
113
+ * `requirements.txt`
114
+ * Any other config files (e.g., `README.md` itself).
115
+ 4. Spaces will install dependencies via `requirements.txt`, and automatically launch `app.py` under ZeroGPU.
116
+ 5. Visit your Space’s URL to try it out.
117
+
118
+ *No explicit `Dockerfile` or server config is needed; ZeroGPU handles the backend. Just ensure `spaces` is in `requirements.txt`.*
119
+
120
+ ---
121
+
122
+ ## File Structure
123
+
124
+ ```
125
+ ├── app.py
126
+ ├── requirements.txt
127
+ ├── README.md
128
+ └── LICENSE (optional)
129
+ ```
130
+
131
+ * **app.py**
132
+
133
+ * Entry point for the Gradio app.
134
+ * Defines `run_asr(...)` decorated with `@spaces.GPU` to enable ZeroGPU offload.
135
+ * Loads the Qwen2.5-Omni model & processor, runs audio preprocessing, inference, decoding, prompt stripping, and Simplified→Traditional conversion.
136
+ * Builds a Gradio Blocks UI (two-column layout).
137
+
138
+ * **requirements.txt**
139
+
140
+ ```text
141
+ # ZeroGPU for CPU-/GPU offload acceleration
142
+ spaces
143
+
144
+ # PyTorch + Transformers
145
+ torch
146
+ transformers
147
+
148
+ # Qwen Omni utilities (for audio preprocessing)
149
+ qwen-omni-utils
150
+
151
+ # OpenCC (simplified→traditional conversion)
152
+ opencc
153
+
154
+ # Gradio v5
155
+ gradio>=5.0.0
156
+ ```
157
+
158
+ * **README.md**
159
+
160
+ * (You’re reading it.)
161
+
162
+ ---
163
+
164
+ ## How It Works
165
+
166
+ 1. **Model & Processor Loading**
167
+
168
+ ```python
169
+ MODEL_ID = "Qwen/Qwen2.5-Omni-3B"
170
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
171
+ MODEL_ID, torch_dtype="auto", device_map="auto"
172
+ )
173
+ model.disable_talker()
174
+ processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
175
+ model.eval()
176
+ ```
177
+
178
+ * `device_map="auto"` + `@spaces.GPU` (ZeroGPU decorator) ensure that, if a GPU is present, weights are offloaded to GPU; otherwise stay on CPU.
179
+ * `disable_talker()` removes any “talker” head to focus purely on ASR.
180
+
181
+ 2. **Message Construction for ASR**
182
+
183
+ ```python
184
+ sys_prompt = (
185
+ "You are Qwen, a virtual human developed by the Qwen Team, "
186
+ "Alibaba Group, capable of perceiving auditory and visual inputs, "
187
+ "as well as generating text and speech."
188
+ )
189
+ messages = [
190
+ {"role": "system", "content": [{"type": "text", "text": sys_prompt}]},
191
+ {
192
+ "role": "user",
193
+ "content": [
194
+ {"type": "audio", "audio": audio_path},
195
+ {"type": "text", "text": user_prompt}
196
+ ],
197
+ },
198
+ ]
199
+ ```
200
+
201
+ * This mirrors the Qwen chat template: first a system message, then a user message containing an uploaded audio file + a textual instruction.
202
+
203
+ 3. **Apply Chat Template & Preprocess**
204
+
205
+ ```python
206
+ text_input = processor.apply_chat_template(
207
+ messages, tokenize=False, add_generation_prompt=True
208
+ )
209
+ audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
210
+ inputs = processor(
211
+ text=text_input,
212
+ audio=audios,
213
+ images=images,
214
+ videos=videos,
215
+ return_tensors="pt",
216
+ padding=True,
217
+ use_audio_in_video=True
218
+ ).to(model.device).to(model.dtype)
219
+ ```
220
+
221
+ * `apply_chat_template(...)` formats the messages into a single input string.
222
+ * `process_mm_info(...)` handles loading & resampling of audio (and potentially extracting video frames, if video files are provided).
223
+ * The final `inputs` tensor dict is ready for `model.generate()`.
224
+
225
+ 4. **Inference & Post-Processing**
226
+
227
+ ```python
228
+ output_tokens = model.generate(
229
+ **inputs,
230
+ use_audio_in_video=True,
231
+ return_audio=False,
232
+ thinker_max_new_tokens=512,
233
+ thinker_do_sample=False
234
+ )
235
+ full_decoded = processor.batch_decode(
236
+ output_tokens, skip_special_tokens=True, clean_up_tokenization_spaces=False
237
+ )[0].strip()
238
+ asr_only = _strip_prompts(full_decoded)
239
+ return cc.convert(asr_only)
240
+ ```
241
+
242
+ * `model.generate(...)` runs a greedy (no sampling) decoding over up to 512 new tokens.
243
+ * `batch_decode(...)` yields a single string that includes all “system … user … assistant” markers.
244
+ * `_strip_prompts(...)` finds the first occurrence of `assistant` in that output and returns only the substring after it, so that the UI sees just the raw transcript.
245
+ * Finally, `opencc` converts that transcript from simplified to Traditional Chinese.
246
+
247
+ ---
248
+
249
+ ## Dependencies
250
+
251
+ All required dependencies are listed in `requirements.txt`. Briefly:
252
+
253
+ * **spaces**: Offload wrapper (ZeroGPU) to auto-dispatch tensors between CPU/GPU.
254
+ * **torch** & **transformers**: Core PyTorch framework and Hugging Face Transformers (to load Qwen2.5-Omni).
255
+ * **qwen-omni-utils**: Utility functions to preprocess audio/video for Qwen2.5-Omni.
256
+ * **opencc**: Simplified→Traditional Chinese converter (uses the “s2t” config).
257
+ * **gradio >= 5.0.0**: For building the web UI.
258
+
259
+ When you run `pip install -r requirements.txt`, all dependencies will be pulled from PyPI.
260
+
261
+ ---
262
+
263
+ ## Configuration
264
+
265
+ * **Model ID**
266
+
267
+ * Defined in `app.py` as `MODEL_ID = "Qwen/Qwen2.5-Omni-3B"`.
268
+ * If you want to try a smaller (or larger) Qwen2.5 model, simply update that string to another HF model repository (e.g., `"Qwen/Qwen2.5-Omni-1B"`), then re-deploy.
269
+
270
+ * **ZeroGPU Offload**
271
+
272
+ * The `@spaces.GPU` decorator on `run_asr(...)` is all you need to enable transparent offloading.
273
+ * No extra config or environment variables are required. Spaces will detect this, install `spaces`, and manage CPU/GPU placement.
274
+
275
+ * **Prompt Customization**
276
+
277
+ * By default, the textbox placeholder is
278
+
279
+ > “Transcribe the attached audio to text with punctuation.”
280
+ * You can customize this string directly in the Gradio component. If you omit the prompt entirely, `run_asr` will still run but may not add punctuation; it’s highly recommended to always provide a user prompt.
281
+
282
+ ---
283
+
284
+ ## Project Structure
285
+
286
+ ```text
287
+ qwen2-omni-asr-zerogpu/
288
+ ├── app.py # Main application code (Gradio + inference logic)
289
+ ├── requirements.txt # All Python dependencies
290
+ ├── README.md # This file
291
+ └── LICENSE # (Optional) License, if you wish to open-source
292
+ ```
293
+
294
+ * **app.py**
295
+
296
+ * Imports: `spaces`, `torch`, `transformers`, `qwen_omni_utils`, `opencc`, `gradio`.
297
+ * Defines a helper `_strip_prompts()` to remove system/user/assistant markers.
298
+ * Implements `run_asr(...)` decorated with `@spaces.GPU`.
299
+ * Builds Gradio Blocks UI (with `gr.Row()`, `gr.Column()`, etc.).
300
+
301
+ * **requirements.txt**
302
+
303
+ * Must include exactly what’s needed to run on Spaces (and locally).
304
+ * ZeroGPU (the `spaces` package) should be first, so that Spaces’s auto-offload wrapper is installed.
305
+
306
+ ---
307
+
308
+ ## Usage Examples
309
+
310
+ 1. **Local Testing**
311
+
312
+ ```bash
313
+ python app.py
314
+ ```
315
+
316
+ * Open your browser to `http://127.0.0.1:7860/`
317
+ * Upload a short `.wav` or `.mp3` file (in Chinese) and click “Transcribe.”
318
+ * Verify that the output is properly punctuated, in Traditional Chinese, and free of system/user prefixes.
319
+
320
+ 2. **Command-Line Invocation**
321
+ Although the main interface is Gradio, you can also import `run_asr` directly in a Python shell to run a single file:
322
+
323
+ ```python
324
+ from app import run_asr
325
+
326
+ transcript = run_asr("path/to/audio.wav", "Transcribe the audio with punctuation.")
327
+ print(transcript) # → Traditional Chinese transcript
328
+ ```
329
+
330
+ 3. **Hugging Face Spaces**
331
+
332
+ * Ensure the repo is pushed to a Space (no special hardware required).
333
+ * The web UI will appear under your Space’s URL (e.g., `https://huggingface.co/spaces/your-username/qwen2-omni-asr-zerogpu`).
334
+ * End users simply upload audio and click “Transcribe.”
335
+
336
+ ---
337
+
338
+ ## Troubleshooting
339
+
340
+ * **“Please upload an audio file first.”**
341
+
342
+ * This warning is returned if you click “Transcribe” without uploading a valid audio path.
343
+ * **Model-not-registered / FunASR Errors**
344
+
345
+ * If you see errors about “model not registered,” make sure you have the latest `qwen-omni-utils` version and check your internet connectivity (HF model downloads).
346
+ * **ZeroGPU Fallback**
347
+
348
+ * If no GPU is detected, ZeroGPU will automatically run inference on CPU. Performance will be slower, but functionality remains identical.
349
+ * **Output Contains “system … user … assistant”**
350
+
351
+ * If you still see system/user/assistant text, check that `_strip_prompts()` is present in `app.py` and is being applied to `full_decoded`.
352
+
353
+ ---
354
+
355
+ ## Contributing
356
+
357
+ 1. **Fork the Repository**
358
+ 2. **Create a New Branch**
359
+
360
+ ```bash
361
+ git checkout -b feature/my-enhancement
362
+ ```
363
+ 3. **Make Your Changes**
364
+
365
+ * Improve prompt-stripping logic, add new model IDs, or enhance the UI.
366
+ * If you add new Python dependencies, remember to update `requirements.txt`.
367
+ 4. **Test Locally**
368
+
369
+ ```bash
370
+ python app.py
371
+ ```
372
+ 5. **Push & Open a Pull Request**
373
+
374
+ * Describe your changes in detail.
375
+ * Ensure the README is updated if new features are added.
376
+
377
+ ---
378
+
379
+ ## License
380
+
381
+ This project is open-source. You can choose a license of your preference (MIT / Apache 2.0 / etc.). If no license file is provided, the default is “All rights reserved by the author.”
382
+
383
+ ---
384
+
385
+ ## Acknowledgments
386
+
387
+ * **Qwen Team (Alibaba)** for the Qwen2.5-Omni model.
388
+ * **Hugging Face** for Transformers, Gradio, and ZeroGPU infrastructure (`spaces` package).
389
+ * **OpenCC** for reliable Simplified→Traditional Chinese conversion.
390
+ * **qwen-omni-utils** for audio/video preprocessing helpers.
391
+
392
+ ---
393
+
394
+ Thank you for trying out the Qwen2.5-Omni ASR (ZeroGPU) Gradio App! If you run into any issues or have suggestions, feel free to open an Issue or Pull Request on GitHub.
app.py ADDED
@@ -0,0 +1,118 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import spaces
2
+ import torch
3
+ from transformers import Qwen2_5OmniForConditionalGeneration, Qwen2_5OmniProcessor
4
+ from qwen_omni_utils import process_mm_info
5
+ from opencc import OpenCC
6
+ import gradio as gr
7
+
8
+ cc = OpenCC("s2t")
9
+
10
+ # Load model & processor exactly as before
11
+ MODEL_ID = "Qwen/Qwen2.5-Omni-3B"
12
+ model = Qwen2_5OmniForConditionalGeneration.from_pretrained(
13
+ MODEL_ID,
14
+ torch_dtype="auto",
15
+ device_map="auto"
16
+ )
17
+ model.disable_talker()
18
+ processor = Qwen2_5OmniProcessor.from_pretrained(MODEL_ID)
19
+ model.eval()
20
+
21
+ def _strip_prompts(full_text: str) -> str:
22
+ """
23
+ Remove “system … user … assistant” from the decoded string
24
+ so only the actual ASR transcript remains.
25
+ """
26
+ marker = "assistant"
27
+ if marker in full_text:
28
+ return full_text.split(marker, 1)[1].strip()
29
+ else:
30
+ return full_text.strip()
31
+
32
+ @spaces.GPU
33
+ def run_asr(
34
+ audio_path: str,
35
+ user_prompt: str
36
+ ) -> str:
37
+ if not audio_path:
38
+ return "⚠️ Please upload an audio file first."
39
+
40
+ # 1) Build the exact same messages
41
+ sys_prompt = 'You are Qwen, a virtual human developed by the Qwen Team, Alibaba Group, capable of perceiving auditory and visual inputs, as well as generating text and speech.'
42
+ messages = [
43
+ {"role": "system", "content": [{"type": "text", "text": sys_prompt}]},
44
+ {
45
+ "role": "user",
46
+ "content": [
47
+ {"type": "audio", "audio": audio_path},
48
+ {"type": "text", "text": user_prompt}
49
+ ],
50
+ },
51
+ ]
52
+
53
+ # 2) Apply chat template
54
+ text_input = processor.apply_chat_template(
55
+ messages,
56
+ tokenize=False,
57
+ add_generation_prompt=True
58
+ )
59
+
60
+ # 3) Preprocess audio/video
61
+ audios, images, videos = process_mm_info(messages, use_audio_in_video=True)
62
+
63
+ # 4) Tokenize & move tensors
64
+ inputs = processor(
65
+ text=text_input,
66
+ audio=audios,
67
+ images=images,
68
+ videos=videos,
69
+ return_tensors="pt",
70
+ padding=True,
71
+ use_audio_in_video=True
72
+ )
73
+ inputs = inputs.to(model.device).to(model.dtype)
74
+
75
+ # 5) Generate
76
+ output_tokens = model.generate(
77
+ **inputs,
78
+ use_audio_in_video=True,
79
+ return_audio=False,
80
+ thinker_max_new_tokens=512,
81
+ thinker_do_sample=False
82
+ )
83
+
84
+ # 6) Decode everything (system+user+assistant)
85
+ full_decoded = processor.batch_decode(
86
+ output_tokens,
87
+ skip_special_tokens=True,
88
+ clean_up_tokenization_spaces=False
89
+ )[0].strip()
90
+
91
+ # 7) Strip off the “system … user … assistant” prefix
92
+ asr_only = _strip_prompts(full_decoded)
93
+
94
+ # 8) Convert to Traditional Chinese and return
95
+ return cc.convert(asr_only)
96
+
97
+ with gr.Blocks() as demo:
98
+ gr.Markdown("## Qwen2.5-Omni ASR → Audio to Punctuated Transcription (ZeroGPU)")
99
+
100
+ with gr.Row():
101
+ audio_input = gr.Audio(label="Upload Audio (WAV/MP3/…)", type="filepath")
102
+ user_input = gr.Textbox(
103
+ label="User Prompt",
104
+ value="Transcribe the attached audio to text with punctuation."
105
+ )
106
+
107
+ submit_btn = gr.Button("Transcribe")
108
+ output_txt = gr.Textbox(label="Transcription (Traditional Chinese)")
109
+
110
+ submit_btn.click(
111
+ fn=run_asr,
112
+ inputs=[audio_input, user_input],
113
+ outputs=output_txt
114
+ )
115
+
116
+ if __name__ == "__main__":
117
+ demo.queue()
118
+ demo.launch()
requirements.txt ADDED
@@ -0,0 +1,15 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # ZeroGPU for CPU-/GPU‐offload acceleration
2
+ spaces
3
+
4
+ # PyTorch + Transformers
5
+ torch
6
+ transformers
7
+
8
+ # Qwen Omni utilities (for audio preprocessing)
9
+ qwen-omni-utils
10
+
11
+ # OpenCC (for simplified→traditional conversion)
12
+ opencc
13
+
14
+ # Gradio v5
15
+ gradio>=5.0.0