abraarsyed commited on
Commit
39ec667
·
1 Parent(s): 92efdf7

Initial Commit

Browse files

Clean deploy with binary files (assets)

Signed-off-by: abraarsyed <abraar.syed01@gmail.com>

README.md CHANGED
@@ -1,14 +1,128 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: Vocalprint Ai
3
- emoji: 📊
4
- colorFrom: indigo
5
- colorTo: red
6
- sdk: gradio
7
- sdk_version: 5.31.0
8
- app_file: app.py
9
- pinned: false
10
- license: mit
11
- short_description: Detect english accent/fluency from video links - YT/Loom/mp4
 
 
 
 
 
 
 
12
  ---
13
 
14
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # VocalPrint AI
2
+
3
+ VocalPrint AI is a CLI + web based tool that detects spoken English accents, scores fluency, and transcribes speech from public video/audio sources.
4
+
5
+ ---
6
+
7
+ ## Features
8
+
9
+ - Detects common English accents:
10
+ - Indian, American, British, Australian, and more
11
+ - Scores fluency based on actual speaking duration
12
+ - Transcribes speech using OpenAI's Whisper model
13
+ - Top-3 accent predictions with confidence values
14
+ - Supports YouTube, Loom, and direct MP4 links
15
+ - Web UI built using Gradio for fast testing
16
+ - CLI and Web UI use a shared processing core
17
+ - JSON output for easy API integration
18
+
19
+ ---
20
+
21
+ ## Technical Highlights
22
+
23
+ - **Models Used**:
24
+ - Whisper (for transcription + language detection)
25
+ - `dima806/english_accents_classification` (for accent prediction)
26
+
27
+ - **Audio Segment Handling**:
28
+ - Only a 30-second segment is extracted from the middle of the video for analysis (to avoid intros and outros)
29
+
30
+ - **Transcript Handling**:
31
+ - Only the first 500 characters of the transcript are returned to keep the result clean
32
+
33
+ - **Output**:
34
+ - Returns JSON with detected accent, confidence %, top-3 predictions, fluency score, language code, and sample transcript
35
+
36
  ---
37
+
38
+ ## Project Structure
39
+
40
+ ```
41
+ vocalprint-ai/
42
+ ├── core/
43
+ │ ├── __init__.py
44
+ │ ├── processor.py # shared logic used by both CLI and web
45
+ │ └── logger.py # shared logger instance
46
+ ├── accent_detection_cli.py # CLI entrypoint
47
+ ├── web/
48
+ │ └── app.py # Web UI via Gradio
49
+ ├── requirements.txt
50
+ ├── README.md
51
+ └── .gitignore
52
+ ```
53
+
54
  ---
55
 
56
+ ## Quick Start
57
+
58
+ ### 1. Install dependencies
59
+
60
+ ```bash
61
+ pip3 install -r requirements.txt
62
+ ```
63
+
64
+ ### 2. Run the CLI tool
65
+
66
+ ```bash
67
+ python3 accent_detection_cli.py \
68
+ --url "https://www.youtube.com/watch?v=W2Jzkl8J2nM" \
69
+ --device cpu
70
+ ```
71
+
72
+ ### 3. Sample output
73
+
74
+ ```bash
75
+ {
76
+ "accent": "canada",
77
+ "accent_confidence": 86.0,
78
+ "top_3_predictions": [
79
+ {
80
+ "accent": "canada",
81
+ "confidence": 86.0
82
+ },
83
+ {
84
+ "accent": "us",
85
+ "confidence": 13.56
86
+ },
87
+ {
88
+ "accent": "england",
89
+ "confidence": 0.21
90
+ }
91
+ ],
92
+ "fluency_score": 100,
93
+ "language_detected_by_whisper": "en",
94
+ "transcript_sample": " you're a mass of competing short term interests. And so the question is then, well, which short term interest should win out? And the answer to that is none of them. They need to be organized into a hierarchy that makes them functional across time and across individuals. So like a two year old is v"
95
+ }
96
+ ```
97
+
98
+ ### 4. Run the Web UI
99
+
100
+ ```bash
101
+ python3 web/app.py
102
+ ```
103
+ Then open `http://localhost:7860` in your browser.
104
+
105
+ ---
106
+
107
+ ## Example Outputs
108
+
109
+ ### 🎤 Example 1 – Indian Accent
110
+ **URL:** [https://www.youtube.com/watch?v=BZ7v0wVrKDo](https://www.youtube.com/watch?v=BZ7v0wVrKDo)
111
+
112
+ ![Indian Accent Example](assets/indian-accent.png)
113
+
114
+ ### 🎤 Example 2 – Canadian Accent
115
+ **URL:** [https://www.youtube.com/watch?v=W2Jzkl8J2nM](https://www.youtube.com/watch?v=W2Jzkl8J2nM)
116
+
117
+ ![Canadian Accent Example](assets/canadian-accent.png)
118
+
119
+ ---
120
+
121
+ ## Known Bottlenecks
122
+
123
+ - Whisper runs on CPU if no GPU is available — can be slow (~20s on CPU)
124
+ - Video download + audio extraction depends on stable network and FFmpeg
125
+ - Some accent misclassifications may occur for mixed/regional speakers
126
+ - Web UI uses a 30-second middle segment — long videos may not be fully analyzed
127
+
128
+ ---
accent_detection_cli.py ADDED
@@ -0,0 +1,67 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import json
3
+ import os
4
+ import tempfile
5
+ import shutil
6
+
7
+ import whisper
8
+ import torch
9
+
10
+ from core.processor import (
11
+ download_video,
12
+ extract_audio,
13
+ transcribe,
14
+ classify_accent,
15
+ compute_fluency
16
+ )
17
+ from core.logger import logger
18
+
19
+
20
+ def main():
21
+ parser = argparse.ArgumentParser(description="Accent & Fluency Detection CLI")
22
+ parser.add_argument('--url', required=True, help='Public video URL (YouTube, Loom, MP4)')
23
+ parser.add_argument('--output', help='Output path for JSON result')
24
+ parser.add_argument('--device', default='auto', choices=['auto', 'cpu', 'cuda'], help='Device to run Whisper')
25
+ parser.add_argument('--keep', action='store_true', help='Keep temporary files')
26
+ args = parser.parse_args()
27
+
28
+ whisper_device = 'cuda' if args.device == 'cuda' and torch.cuda.is_available() else 'cpu'
29
+ whisper_model = whisper.load_model("small", device=whisper_device)
30
+
31
+ temp_dir = tempfile.mkdtemp()
32
+ audio_path = os.path.join(temp_dir, "audio.wav")
33
+
34
+ try:
35
+ download_video(args.url, temp_dir)
36
+ video_file = next((f for f in os.listdir(temp_dir) if f.endswith(".mp4")), None)
37
+ if not video_file:
38
+ raise FileNotFoundError("No .mp4 file found in temp dir")
39
+ extract_audio(os.path.join(temp_dir, video_file), audio_path)
40
+ transcript, segments, language = transcribe(audio_path, whisper_model)
41
+ top_accent, confidence, top3 = classify_accent(audio_path)
42
+ fluency = compute_fluency(segments)
43
+
44
+ result = {
45
+ "accent": top_accent,
46
+ "accent_confidence": confidence,
47
+ "top_3_predictions": top3,
48
+ "fluency_score": fluency,
49
+ "language_detected_by_whisper": language,
50
+ "transcript_sample": transcript[:300]
51
+ }
52
+
53
+ print(json.dumps(result, indent=2))
54
+ if args.output:
55
+ with open(args.output, 'w') as f:
56
+ json.dump(result, f, indent=2)
57
+ logger.info(f"Saved output to {args.output}")
58
+ except Exception as e:
59
+ logger.error(f"FAILED: {e}")
60
+ finally:
61
+ if args.keep:
62
+ logger.info(f"Temporary files kept in: {temp_dir}")
63
+ else:
64
+ shutil.rmtree(temp_dir)
65
+
66
+ if __name__ == "__main__":
67
+ main()
app.py ADDED
@@ -0,0 +1,2 @@
 
 
 
1
+ from web.app import iface
2
+ iface.launch()
core/__init__.py ADDED
File without changes
core/__pycache__/__init__.cpython-312.pyc ADDED
Binary file (159 Bytes). View file
 
core/__pycache__/logger.cpython-312.pyc ADDED
Binary file (732 Bytes). View file
 
core/__pycache__/processor.cpython-312.pyc ADDED
Binary file (5.64 kB). View file
 
core/logger.py ADDED
@@ -0,0 +1,11 @@
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import logging
2
+
3
+ logger = logging.getLogger("VocalPrint")
4
+ logger.setLevel(logging.INFO)
5
+
6
+ # Only add handler if not already present (avoid duplicates on reload)
7
+ if not logger.hasHandlers():
8
+ handler = logging.StreamHandler()
9
+ formatter = logging.Formatter('%(asctime)s | %(levelname)s | %(message)s')
10
+ handler.setFormatter(formatter)
11
+ logger.addHandler(handler)
core/processor.py ADDED
@@ -0,0 +1,82 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import subprocess
3
+ import requests
4
+ import torch
5
+ import yt_dlp
6
+ import soundfile as sf
7
+
8
+ from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
9
+ from scipy.special import softmax
10
+
11
+ from core.logger import logger
12
+
13
+
14
+ MODEL_ID = "dima806/english_accents_classification"
15
+ accent_model = Wav2Vec2ForSequenceClassification.from_pretrained(MODEL_ID)
16
+ feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(MODEL_ID)
17
+ labels = list(accent_model.config.id2label.values())
18
+
19
+ device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
20
+ accent_model.to(device)
21
+
22
+
23
+ def download_video(url, output_dir):
24
+ logger.info(f"Downloading video from: {url}")
25
+ try:
26
+ if any(x in url for x in ["youtube.com", "youtu.be", "loom.com"]):
27
+ ydl_opts = {
28
+ 'format': 'bestvideo+bestaudio/best',
29
+ 'merge_output_format': 'mp4',
30
+ 'outtmpl': os.path.join(output_dir, 'input_video.%(ext)s'),
31
+ 'quiet': True,
32
+ 'no_warnings': True
33
+ }
34
+ with yt_dlp.YoutubeDL(ydl_opts) as ydl:
35
+ ydl.download([url])
36
+ else:
37
+ response = requests.get(url, stream=True, timeout=20)
38
+ response.raise_for_status()
39
+ filepath = os.path.join(output_dir, "input_video.mp4")
40
+ with open(filepath, 'wb') as f:
41
+ for chunk in response.iter_content(chunk_size=8192):
42
+ f.write(chunk)
43
+ except Exception as e:
44
+ logger.error(f"Failed to download video: {e}")
45
+ raise
46
+
47
+
48
+ def extract_audio(video_path, audio_path):
49
+ logger.info("Extracting audio from video...")
50
+ subprocess.run([
51
+ 'ffmpeg', '-y', '-i', video_path,
52
+ '-ss', '00:00:15', '-t', '00:00:30',
53
+ '-ar', '16000', '-ac', '1',
54
+ '-loglevel', 'error', audio_path
55
+ ], check=True)
56
+
57
+
58
+ def transcribe(audio_path, whisper_model):
59
+ logger.info("Transcribing with Whisper...")
60
+ result = whisper_model.transcribe(audio_path)
61
+ return result["text"], result["segments"], result["language"]
62
+
63
+
64
+ def classify_accent(audio_path):
65
+ logger.info("Running accent classification...")
66
+ waveform, sample_rate = sf.read(audio_path)
67
+ inputs = feature_extractor(waveform, sampling_rate=sample_rate, return_tensors="pt", padding=True)
68
+ inputs = {k: v.to(device) for k, v in inputs.items()}
69
+ with torch.no_grad():
70
+ logits = accent_model(**inputs).logits
71
+ probs = softmax(logits[0].cpu().numpy())
72
+ top_indices = probs.argsort()[::-1][:3]
73
+ top_accents = [{"accent": labels[i], "confidence": round(float(probs[i]) * 100, 2)} for i in top_indices]
74
+ return top_accents[0]["accent"], top_accents[0]["confidence"], top_accents
75
+
76
+
77
+ def compute_fluency(segments):
78
+ if not segments:
79
+ return 0
80
+ total_time = segments[-1]['end']
81
+ speaking_time = sum(seg['end'] - seg['start'] for seg in segments)
82
+ return int(min(speaking_time / total_time * 100, 100))
requirements.txt ADDED
@@ -0,0 +1,12 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ torch>=2.0.0
2
+ torchaudio>=2.0.0
3
+ transformers>=4.38.0
4
+ openai-whisper
5
+ ffmpeg-python
6
+ requests
7
+ yt-dlp
8
+ protobuf
9
+ soundfile
10
+ scipy
11
+ gradio>=4.0.0
12
+ librosa>=0.10.0
web/app.py ADDED
@@ -0,0 +1,74 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # web/app.py
2
+ # Gradio-based web UI for VocalPrint AI (Refactored to use shared CLI logic)
3
+
4
+ import gradio as gr
5
+ import os
6
+ import tempfile
7
+ import whisper
8
+ import torch
9
+ import json
10
+ import sys
11
+
12
+ # Ensure parent directory is in path
13
+ sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
14
+
15
+ from core.processor import (
16
+ download_video,
17
+ extract_audio,
18
+ transcribe,
19
+ classify_accent,
20
+ compute_fluency
21
+ )
22
+
23
+ # Load Whisper model once
24
+ whisper_model = whisper.load_model("small")
25
+
26
+ def process_video(url):
27
+ try:
28
+ temp_dir = tempfile.mkdtemp()
29
+ video_path = os.path.join(temp_dir, "video.mp4")
30
+ audio_path = os.path.join(temp_dir, "audio.wav")
31
+
32
+ download_video(url, temp_dir)
33
+ video_file = next((f for f in os.listdir(temp_dir) if f.endswith(".mp4")), None)
34
+ if not video_file:
35
+ raise FileNotFoundError("No .mp4 file found")
36
+
37
+ extract_audio(os.path.join(temp_dir, video_file), audio_path)
38
+ transcript, segments, language = transcribe(audio_path, whisper_model)
39
+ top_accent, confidence, top3 = classify_accent(audio_path)
40
+ fluency = compute_fluency(segments)
41
+
42
+ # Format the top3 for the dataframe display
43
+ top3_formatted = [[item["accent"], f"{item['confidence']}%"] for item in top3]
44
+
45
+ return (
46
+ top_accent,
47
+ f"{confidence}%",
48
+ fluency,
49
+ language,
50
+ transcript[:500],
51
+ top3_formatted
52
+ )
53
+ except Exception as e:
54
+ return ("Error", "-", "-", "-", str(e), [])
55
+
56
+ iface = gr.Interface(
57
+ fn=process_video,
58
+ inputs=gr.Textbox(label="Public Video URL (YouTube, Loom, MP4)", placeholder="https://..."),
59
+ outputs=[
60
+ gr.Textbox(label="Detected Accent"),
61
+ gr.Textbox(label="Confidence (%)"),
62
+ gr.Textbox(label="Fluency Score (0–100)"),
63
+ gr.Textbox(label="Language Detected by Whisper"),
64
+ gr.Textbox(label="Transcript Sample (first 500 chars)"),
65
+ gr.Dataframe(headers=["Accent", "Confidence"], label="Top 3 Accent Predictions")
66
+ ],
67
+ title="VocalPrint AI",
68
+ description="Analyze English speech from a public video link to detect accent, fluency, and transcription.",
69
+ allow_flagging="never",
70
+ theme="default"
71
+ )
72
+
73
+ if __name__ == "__main__":
74
+ iface.launch()