Spaces:

abraarsyed
/

vocalprint-ai

Sleeping

App Files Files Community

abraarsyed commited on May 24

Commit

39ec667

1 Parent(s): 92efdf7

Initial Commit

Browse files

Clean deploy with binary files (assets)

Signed-off-by: abraarsyed <abraar.syed01@gmail.com>

Files changed (11) hide show

README.md +125 -11
accent_detection_cli.py +67 -0
app.py +2 -0
core/__init__.py +0 -0
core/__pycache__/__init__.cpython-312.pyc +0 -0
core/__pycache__/logger.cpython-312.pyc +0 -0
core/__pycache__/processor.cpython-312.pyc +0 -0
core/logger.py +11 -0
core/processor.py +82 -0
requirements.txt +12 -0
web/app.py +74 -0

README.md CHANGED Viewed

@@ -1,14 +1,128 @@
 ---
-title: Vocalprint Ai
-emoji: 📊
-colorFrom: indigo
-colorTo: red
-sdk: gradio
-sdk_version: 5.31.0
-app_file: app.py
-pinned: false
-license: mit
-short_description: Detect english accent/fluency from video links - YT/Loom/mp4
 ---
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# VocalPrint AI
+VocalPrint AI is a CLI + web based tool that detects spoken English accents, scores fluency, and transcribes speech from public video/audio sources.
+---
+## Features
+- Detects common English accents:
+  - Indian, American, British, Australian, and more
+- Scores fluency based on actual speaking duration
+- Transcribes speech using OpenAI's Whisper model
+- Top-3 accent predictions with confidence values
+- Supports YouTube, Loom, and direct MP4 links
+- Web UI built using Gradio for fast testing
+- CLI and Web UI use a shared processing core
+- JSON output for easy API integration
+---
+## Technical Highlights
+- **Models Used**:
+  - Whisper (for transcription + language detection)
+  - `dima806/english_accents_classification` (for accent prediction)
+- **Audio Segment Handling**:
+  - Only a 30-second segment is extracted from the middle of the video for analysis (to avoid intros and outros)
+- **Transcript Handling**:
+  - Only the first 500 characters of the transcript are returned to keep the result clean
+- **Output**:
+  - Returns JSON with detected accent, confidence %, top-3 predictions, fluency score, language code, and sample transcript
 ---
+## Project Structure
+```
+vocalprint-ai/
+├── core/
+│   ├── __init__.py
+│   ├── processor.py         # shared logic used by both CLI and web
+│   └── logger.py            # shared logger instance
+├── accent_detection_cli.py  # CLI entrypoint
+├── web/
+│   └── app.py               # Web UI via Gradio
+├── requirements.txt
+├── README.md
+└── .gitignore
+```
 ---
+## Quick Start
+### 1. Install dependencies
+```bash
+pip3 install -r requirements.txt
+```
+### 2. Run the CLI tool
+```bash
+python3 accent_detection_cli.py \
+  --url "https://www.youtube.com/watch?v=W2Jzkl8J2nM" \
+  --device cpu
+```
+### 3. Sample output
+```bash
+{
+  "accent": "canada",
+  "accent_confidence": 86.0,
+  "top_3_predictions": [
+    {
+      "accent": "canada",
+      "confidence": 86.0
+    },
+    {
+      "accent": "us",
+      "confidence": 13.56
+    },
+    {
+      "accent": "england",
+      "confidence": 0.21
+    }
+  ],
+  "fluency_score": 100,
+  "language_detected_by_whisper": "en",
+  "transcript_sample": " you're a mass of competing short term interests. And so the question is then, well, which short term interest should win out? And the answer to that is none of them. They need to be organized into a hierarchy that makes them functional across time and across individuals. So like a two year old is v"
+}
+```
+### 4. Run the Web UI
+```bash
+python3 web/app.py
+```
+Then open `http://localhost:7860` in your browser.
+---
+## Example Outputs
+### 🎤 Example 1 – Indian Accent
+**URL:** [https://www.youtube.com/watch?v=BZ7v0wVrKDo](https://www.youtube.com/watch?v=BZ7v0wVrKDo)
+![Indian Accent Example](assets/indian-accent.png)
+### 🎤 Example 2 – Canadian Accent
+**URL:** [https://www.youtube.com/watch?v=W2Jzkl8J2nM](https://www.youtube.com/watch?v=W2Jzkl8J2nM)
+![Canadian Accent Example](assets/canadian-accent.png)
+---
+## Known Bottlenecks
+- Whisper runs on CPU if no GPU is available — can be slow (~20s on CPU)
+- Video download + audio extraction depends on stable network and FFmpeg
+- Some accent misclassifications may occur for mixed/regional speakers
+- Web UI uses a 30-second middle segment — long videos may not be fully analyzed
+---

accent_detection_cli.py ADDED Viewed

	@@ -0,0 +1,67 @@

+import argparse
+import json
+import os
+import tempfile
+import shutil
+import whisper
+import torch
+from core.processor import (
+    download_video,
+    extract_audio,
+    transcribe,
+    classify_accent,
+    compute_fluency
+)
+from core.logger import logger
+def main():
+    parser = argparse.ArgumentParser(description="Accent & Fluency Detection CLI")
+    parser.add_argument('--url', required=True, help='Public video URL (YouTube, Loom, MP4)')
+    parser.add_argument('--output', help='Output path for JSON result')
+    parser.add_argument('--device', default='auto', choices=['auto', 'cpu', 'cuda'], help='Device to run Whisper')
+    parser.add_argument('--keep', action='store_true', help='Keep temporary files')
+    args = parser.parse_args()
+    whisper_device = 'cuda' if args.device == 'cuda' and torch.cuda.is_available() else 'cpu'
+    whisper_model = whisper.load_model("small", device=whisper_device)
+    temp_dir = tempfile.mkdtemp()
+    audio_path = os.path.join(temp_dir, "audio.wav")
+    try:
+        download_video(args.url, temp_dir)
+        video_file = next((f for f in os.listdir(temp_dir) if f.endswith(".mp4")), None)
+        if not video_file:
+            raise FileNotFoundError("No .mp4 file found in temp dir")
+        extract_audio(os.path.join(temp_dir, video_file), audio_path)
+        transcript, segments, language = transcribe(audio_path, whisper_model)
+        top_accent, confidence, top3 = classify_accent(audio_path)
+        fluency = compute_fluency(segments)
+        result = {
+            "accent": top_accent,
+            "accent_confidence": confidence,
+            "top_3_predictions": top3,
+            "fluency_score": fluency,
+            "language_detected_by_whisper": language,
+            "transcript_sample": transcript[:300]
+        }
+        print(json.dumps(result, indent=2))
+        if args.output:
+            with open(args.output, 'w') as f:
+                json.dump(result, f, indent=2)
+            logger.info(f"Saved output to {args.output}")
+    except Exception as e:
+        logger.error(f"FAILED: {e}")
+    finally:
+        if args.keep:
+            logger.info(f"Temporary files kept in: {temp_dir}")
+        else:
+            shutil.rmtree(temp_dir)
+if __name__ == "__main__":
+    main()

app.py ADDED Viewed

	@@ -0,0 +1,2 @@


1	+ from web.app import iface
2	+ iface.launch()

core/__init__.py ADDED Viewed

File without changes

core/__pycache__/__init__.cpython-312.pyc ADDED Viewed

Binary file (159 Bytes). View file

core/__pycache__/logger.cpython-312.pyc ADDED Viewed

Binary file (732 Bytes). View file

core/__pycache__/processor.cpython-312.pyc ADDED Viewed

Binary file (5.64 kB). View file

core/logger.py ADDED Viewed

	@@ -0,0 +1,11 @@

+import logging
+logger = logging.getLogger("VocalPrint")
+logger.setLevel(logging.INFO)
+# Only add handler if not already present (avoid duplicates on reload)
+if not logger.hasHandlers():
+    handler = logging.StreamHandler()
+    formatter = logging.Formatter('%(asctime)s | %(levelname)s | %(message)s')
+    handler.setFormatter(formatter)
+    logger.addHandler(handler)

core/processor.py ADDED Viewed

	@@ -0,0 +1,82 @@

+import os
+import subprocess
+import requests
+import torch
+import yt_dlp
+import soundfile as sf
+from transformers import Wav2Vec2ForSequenceClassification, Wav2Vec2FeatureExtractor
+from scipy.special import softmax
+from core.logger import logger
+MODEL_ID = "dima806/english_accents_classification"
+accent_model = Wav2Vec2ForSequenceClassification.from_pretrained(MODEL_ID)
+feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained(MODEL_ID)
+labels = list(accent_model.config.id2label.values())
+device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
+accent_model.to(device)
+def download_video(url, output_dir):
+    logger.info(f"Downloading video from: {url}")
+    try:
+        if any(x in url for x in ["youtube.com", "youtu.be", "loom.com"]):
+            ydl_opts = {
+                'format': 'bestvideo+bestaudio/best',
+                'merge_output_format': 'mp4',
+                'outtmpl': os.path.join(output_dir, 'input_video.%(ext)s'),
+                'quiet': True,
+                'no_warnings': True
+            }
+            with yt_dlp.YoutubeDL(ydl_opts) as ydl:
+                ydl.download([url])
+        else:
+            response = requests.get(url, stream=True, timeout=20)
+            response.raise_for_status()
+            filepath = os.path.join(output_dir, "input_video.mp4")
+            with open(filepath, 'wb') as f:
+                for chunk in response.iter_content(chunk_size=8192):
+                    f.write(chunk)
+    except Exception as e:
+        logger.error(f"Failed to download video: {e}")
+        raise
+def extract_audio(video_path, audio_path):
+    logger.info("Extracting audio from video...")
+    subprocess.run([
+        'ffmpeg', '-y', '-i', video_path,
+        '-ss', '00:00:15', '-t', '00:00:30',
+        '-ar', '16000', '-ac', '1',
+        '-loglevel', 'error', audio_path
+    ], check=True)
+def transcribe(audio_path, whisper_model):
+    logger.info("Transcribing with Whisper...")
+    result = whisper_model.transcribe(audio_path)
+    return result["text"], result["segments"], result["language"]
+def classify_accent(audio_path):
+    logger.info("Running accent classification...")
+    waveform, sample_rate = sf.read(audio_path)
+    inputs = feature_extractor(waveform, sampling_rate=sample_rate, return_tensors="pt", padding=True)
+    inputs = {k: v.to(device) for k, v in inputs.items()}
+    with torch.no_grad():
+        logits = accent_model(**inputs).logits
+    probs = softmax(logits[0].cpu().numpy())
+    top_indices = probs.argsort()[::-1][:3]
+    top_accents = [{"accent": labels[i], "confidence": round(float(probs[i]) * 100, 2)} for i in top_indices]
+    return top_accents[0]["accent"], top_accents[0]["confidence"], top_accents
+def compute_fluency(segments):
+    if not segments:
+        return 0
+    total_time = segments[-1]['end']
+    speaking_time = sum(seg['end'] - seg['start'] for seg in segments)
+    return int(min(speaking_time / total_time * 100, 100))

requirements.txt ADDED Viewed

	@@ -0,0 +1,12 @@

+torch>=2.0.0
+torchaudio>=2.0.0
+transformers>=4.38.0
+openai-whisper
+ffmpeg-python
+requests
+yt-dlp
+protobuf
+soundfile
+scipy
+gradio>=4.0.0
+librosa>=0.10.0

web/app.py ADDED Viewed

	@@ -0,0 +1,74 @@

+# web/app.py
+# Gradio-based web UI for VocalPrint AI (Refactored to use shared CLI logic)
+import gradio as gr
+import os
+import tempfile
+import whisper
+import torch
+import json
+import sys
+# Ensure parent directory is in path
+sys.path.append(os.path.abspath(os.path.join(os.path.dirname(__file__), '..')))
+from core.processor import (
+    download_video,
+    extract_audio,
+    transcribe,
+    classify_accent,
+    compute_fluency
+)
+# Load Whisper model once
+whisper_model = whisper.load_model("small")
+def process_video(url):
+    try:
+        temp_dir = tempfile.mkdtemp()
+        video_path = os.path.join(temp_dir, "video.mp4")
+        audio_path = os.path.join(temp_dir, "audio.wav")
+        download_video(url, temp_dir)
+        video_file = next((f for f in os.listdir(temp_dir) if f.endswith(".mp4")), None)
+        if not video_file:
+            raise FileNotFoundError("No .mp4 file found")
+        extract_audio(os.path.join(temp_dir, video_file), audio_path)
+        transcript, segments, language = transcribe(audio_path, whisper_model)
+        top_accent, confidence, top3 = classify_accent(audio_path)
+        fluency = compute_fluency(segments)
+        # Format the top3 for the dataframe display
+        top3_formatted = [[item["accent"], f"{item['confidence']}%"] for item in top3]
+        return (
+            top_accent,
+            f"{confidence}%",
+            fluency,
+            language,
+            transcript[:500],
+            top3_formatted
+        )
+    except Exception as e:
+        return ("Error", "-", "-", "-", str(e), [])
+iface = gr.Interface(
+    fn=process_video,
+    inputs=gr.Textbox(label="Public Video URL (YouTube, Loom, MP4)", placeholder="https://..."),
+    outputs=[
+        gr.Textbox(label="Detected Accent"),
+        gr.Textbox(label="Confidence (%)"),
+        gr.Textbox(label="Fluency Score (0–100)"),
+        gr.Textbox(label="Language Detected by Whisper"),
+        gr.Textbox(label="Transcript Sample (first 500 chars)"),
+        gr.Dataframe(headers=["Accent", "Confidence"], label="Top 3 Accent Predictions")
+    ],
+    title="VocalPrint AI",
+    description="Analyze English speech from a public video link to detect accent, fluency, and transcription.",
+    allow_flagging="never",
+    theme="default"
+)
+if __name__ == "__main__":
+    iface.launch()