Spaces:
Running
A newer version of the Gradio SDK is available:
5.38.0
title: Contextual Video Data Server (MCP Tool/Server) - The Ultimate Video Whisperer!
emoji: ๐
colorFrom: indigo
colorTo: purple
sdk: gradio
sdk_version: 4.25.0
app_file: app.py
pinned: false
license: mit
short_description: Video analysis (transcription, caption, actions) for LLMs.
tags:
- mcp-server-track
๐ฌ Contextual Video Data Server (MCP Tool/Server) - The Ultimate Video Whisperer! ๐
Welcome to the Contextual Video Data MCP Server! This isn't just any project; it's our glorious entry into Track 1 ("MCP Tool / Server") of the Agents-MCP-Hackathon! ๐ Our mission, should we choose to accept it (and we totally do!), is to build an MCP (Model Context Protocol) server that acts like a super-smart assistant for Large Language Models (LLMs), feeding them rich, juicy, contextual information extracted directly from videos. Think of it as giving LLMs eyes and ears for the video world! ๐๐
Our grand vision? To take any video you throw at us (URLs, direct uploads, maybe even a carrier pigeon with a USB stick ๐๏ธ... okay, maybe not the pigeon) and dissect it. We're talking:
- Flawless audio transcriptions (what's being said).
- And the really exciting part: comprehensive visual interpretations (what's being seen and done)! This includes:
- Video Captioning: A snappy summary of the video's content.
- Action Recognition: Identifying all the cool (or mundane) actions happening.
- Object Detection/Tracking: Pinpointing and following key objects through the frames.
All this delicious data will be neatly packaged and served up via an MCP-compliant API endpoint, ready for LLMs to gobble up and become even more insightful. Let's make those AI brains BIGGER! ๐ง ๐ฅ
๐ Hackathon Context (Track 1: MCP Tool / Server) - Our Quest!
This project is laser-focused on conquering Track 1 of the Agents-MCP-Hackathon. Here's how we're hitting all the right notes ๐ถ:
- MCP Server/Tool Extraordinaire: We're building a Gradio application, destined to live on Hugging Face Spaces, that proudly stands as an MCP server. It's the digital butler for video context!
- Supercharging LLMs: Our server's raison d'รชtre is to provide video-derived context, empowering LLMs to deliver responses that are not just smart, but video-smart.
- Showtime! The Demo Requirement: We know talk is cheap. That's why a crucial part of our submission will be a dazzling video demonstration (linked right here in this README, eventually!). It'll showcase our MCP server strutting its stuff with an external MCP Client (like Claude Desktop, Cursor, Tiny Agents, or even another cool Gradio app). You'll see the magic unfold: video in โก๏ธ context out โก๏ธ happy LLM. ๐ช
Our Demo Video: https://www.loom.com/share/0ea7a160a3b240a399344d25b1d23a05?sid=28891b9f-7290-469f-b193-6c006277658c
๐๏ธ Project Architecture - The Three Musketeers of Video Context!
Our system isn't just thrown together; it's a masterpiece of engineering (if we do say so ourselves ๐). We've got a refined three-tier architecture, ensuring everyone plays their part perfectly:
Gradio App (This Hugging Face Space - Our MCP Server HQ ๐ฐ):
- The welcoming face of our operation! Handles video uploads (URLs or your precious files).
- The grand conductor ๐ป, orchestrating the video processing symphony by calling in our Modal backend.
- The meticulous librarian ๐, structuring all the extracted goodies (transcriptions, and soon, a smorgasbord of visual data).
- The ever-ready API provider, serving up context via an MCP-compliant endpoint (think
gr.JSON()
or a slick FastAPI route) to any LLM frontend that comes knocking. - Super Important Note: This Gradio app is a dedicated MCP server. No direct LLM chit-chat here! It's all about processing videos and serving data, keeping things clean and focused. โจ
Modal Backend (The Heavy Lifter ๐ช - Our Digital Hercules!):
- This is where the real grunt work happens. Got a computationally intensive task? Modal's on it!
- Currently wrestling with audio extraction and Whisper transcriptions using behemoth models like
openai/whisper-large-v3
. - Gearing up to tackle our Triple Threat Video Analysisโข: captioning, action recognition, and object detection/tracking. It's gonna be epic!
- Summoned by the Gradio App, it delivers efficiency and scalability like a champ. ๐ฅ
Another Hugging Face Space (The LLM's Frontend Friend ๐ค - Not Our Circus, Not Our Monkeys for this Task):
- Imagine this as an external buddy project โ the cool app where end-users chat with an LLM (Claude, Llama, you name it).
- This buddy will call our Gradio MCP Server to get the video lowdown.
- Then, armed with our context, it'll help the LLM craft super-duper responses.
- This separation of powers is key! It keeps our Contextual Video Data Server lean, mean, and a top-notch MCP server, just what the hackathon ordered.
โจ Features - What Makes Us Sparkle!
- ๐ฅ Versatile Video Ingestion: Handles YouTube links (thanks,
yt-dlp
!) and direct file uploads with grace. - ๐ค Crystal-Clear Transcriptions: Leverages the mighty Whisper models on Modal for top-tier audio-to-text conversion.
- ๐ค MCP-Compliant API: Serves up structured JSON data (transcriptions now, a feast of video analysis soon!) via a well-defined API endpoint.
- ๐ฅ๏ธ User-Friendly Gradio UI: A simple, intuitive interface for uploading videos and seeing the magic happen (locally for now, soon on HF Spaces!).
- ๐ THE BIG ONE (PLANNED!): Parallel Multi-Modal Video Interpretation! ๐
- ๐ผ๏ธ Video Captioning: What's the gist of this video?
- ๐ Action Recognition: Who's doing what? (Running? Jumping? Contemplating the universe?)
- ๐ Object Detection/Tracking: What's in the scene, and where is it going? (Is that a cat or a very fluffy loaf of bread? ๐)
- All processed in parallel (because we're ambitious like that!) and presented in an LLM-friendly format.
๐บ๏ธ Our Epic Development Journey & Discoveries So Far! (The Chronicles of Context ๐)
This hasn't been just coding; it's been an adventure! Here are some highlights from our quest for video understanding:
- The Foundation: We started by bravely integrating
yt-dlp
(for taming wild YouTube videos) andmoviepy
(our trusty audio-extracting squire). - The Heart of Transcription: We summoned the power of Whisper on Modal, using the Hugging Face
transformers
pipeline as our spellbook. แบฟm - The Ascent of Models (A Tale of Quality):
- Our quest began humbly with
openai/whisper-base
. - We then climbed the ladder to
openai/whisper-small
,openai/whisper-medium
, and have now reached the peak (for now!) withopenai/whisper-large-v3
, all in pursuit of transcription perfection! ๐๏ธ
- Our quest began humbly with
- Taming the Parameters: We dueled with
temperature
andno_repeat_ngram_size
(our secret weapons ingenerate_kwargs
) to banish repetitive demons and ensure coherent narratives. - Speaking the Right Language: We wisely added
language="en"
togenerate_kwargs
, ensuring Whisper knew what to expect (no more accidental Klingon transcriptions!). - Architectural Epiphanies: Like master builders, we refined our design to the glorious three-tier architecture you see today โ a beacon of clarity and scalability, fit for MCP royalty. ๐
- Slaying Dragons (aka Debugging): We've vanquished pesky bugs, from tricky parameter passing to the Hugging Face pipeline to ensuring our Modal environment's dependencies were as harmonious as a barbershop quartet. barbershop
๐ Project Structure - Know Your Way Around!
app.py
: The command center! Our main Gradio application (MCP Server), handling user interactions and the all-important API endpoint.modal_whisper_app.py
: The engine room! Defines the Modal app and the functions that do the heavy lifting (transcribe_video_audio
and its future video-analyzing siblings).requirements.txt
: The shopping list for our localapp.py
's Python dependencies.README.md
: You're looking at it! Our project's story, map, and instruction manual, all rolled into one. ๐บ๏ธ
๐ ๏ธ Setup - Let's Get This Party Started!
Prerequisites - The Essentials Before the Magic!
- Python 3.10+ (because we like our Python fresh! ๐)
- A Modal account, with the CLI installed and configured (
pip install modal-client
, thenmodal setup
). This is your key to the Modal kingdom! ๐ ffmpeg
installed locally. This digital Swiss Army knife is crucial foryt-dlp
andmoviepy
.- Debian/Ubuntu:
sudo apt update && sudo apt install ffmpeg
- macOS (Homebrew):
brew install ffmpeg
- Debian/Ubuntu:
Local Setup - Your Very Own Context Server!
Clone the Treasure Chest (Our Repository)!
# Make sure this URL points to our magnificent repo! git clone https://github.com/jomasego/video_mcp.git cd video_mcp
Install the Local Spells (Dependencies)!
pip install -r requirements.txt
Unleash the Modal Beast (Deploy the Function)! Make sure your Modal CLI is logged in and ready to rumble!
modal deploy modal_whisper_app.py
Running the Local Application (Our MCP Server in Action!)
- Ignite the Gradio App!
python3 app.py
- Point your trusty web browser to the URL Gradio provides (usually a friendly
http://127.0.0.1:7860
). - Feed it videos! Watch it work! Marvel at its (soon-to-be-even-more-awesome) power! ๐คฉ
๐ฎ Modal Function Details - The Wizardry Behind the Curtain!
The modal_whisper_app.py
script is where the real enchantment happens. It defines Modal functions that:
- Live in a custom Docker image, armed with
ffmpeg
,transformers
,torch
,moviepy
,soundfile
, andhuggingface_hub
โ all the tools a growing video AI needs! - Currently,
transcribe_video_audio
bravely takes video bytes, extracts the audio, and transcribes it using our chosen Whisper champion (defaulting to the mightyopenai/whisper-large-v3
). - Coming Soon: New functions (or an upgraded mega-function!) to perform the Triple Threat Video Analysisโข (captioning, action recognition, object detection).
- Might need your Hugging Face token (via a Modal secret like
HF_TOKEN_SECRET
) if we dabble in gated models or just want to be polite to the Hugging Face Hub. ๐ค
๐ก API Endpoint & MCP Integration Plan - Talking to the LLMs!
Our Gradio app (app.py
) isn't just a pretty face; it's a communication hub! It'll expose a robust API endpoint for MCP clients.
- How We'll Build It: We're thinking a Gradio Interface function with a catchy
api_name
, or maybe a sleek FastAPI route snuggled into our Gradio app. Options, options! ๐ค - The Language of LLMs (Output Format): Our API will speak fluent JSON. It'll provide a beautifully structured package containing the transcription, and soon, all the video analysis gold (captions, actions, objects). Imagine something like:
(Okay, the exact structure is TBD, but you get the idea โ rich and LLM-ready!){ "transcription": "The quick brown fox jumps over the lazy dog...", "video_caption": "A montage of adorable animal antics.", "actions_detected": ["jumping", "sleeping", "being generally cute"], "objects_of_interest": [ {"label": "fox", "confidence": 0.95, "bounding_box": [10, 20, 50, 60]}, {"label": "dog", "confidence": 0.98, "bounding_box": [100, 120, 80, 70]} ] }
- The Grand Demo Video ๐ฌ: As per the sacred hackathon scrolls, we will create a video. This video will be our magnum opus, showcasing an MCP client (Claude Desktop? Cursor? A plucky custom Gradio client?) calling our API, fetching this glorious context, and using it to achieve new heights of LLM wisdom. We're still picking our co-star for this demo, so stay tuned! ๐
๐ Future Work & Next Steps - To Infinity and Beyond!
We're not stopping at 'good enough'; we're aiming for 'mind-blowingly awesome'! ๐คฏ
- Unleash the Triple Threat Video Analysisโข!
- Scour the Hugging Face Hub and beyond for the best models for video captioning, action recognition, and object detection/tracking. ๐ต๏ธโโ๏ธ
- Integrate these champions into new or existing Modal functions. We want parallel processing for maximum speed and insight!
- Upgrade our Gradio app and API to proudly present these new layers of video understanding.
- Fortify the API Endpoint: Make it rock-solid. Bulletproof error handling. Consistent, crystal-clear output. An API so good, MCP clients will write songs about it. ๐ถ
- Choose Our Champion (MCP Client) & Film the Epic Demo: Finalize which MCP client will star alongside our server in the demo video. Then, lights, camera, action! ๐ฅ We need a compelling showcase for the hackathon judges.
- The Never-Ending Quest for Perfection: Continuously refine transcription accuracy. Squeeze out every last drop of processing speed. Optimize costs. The journey never truly ends! ๐ค๏ธ
๐คฏ Troubleshooting - Don't Panic! (Usually...)
When the digital gremlins strike, here are a few things to check:
ModuleNotFoundError: No module named 'moviepy.editor'
(lurking in Modal logs): Ah, the classic!moviepy
might be playing hide-and-seek in your Modal image. Double-checkpip_install
and anyrun_commands
inmodal_whisper_app.py
. A redeploy might be in order.yt-dlp
throwing a tantrum orffmpeg
acting shy: Ensureffmpeg
is installed both locally (forapp.py
's antics) AND within the Modal image (apt_install("ffmpeg")
). It needs to be everywhere, like a helpful ninja. ๐ฅท- Modal Authentication Woes or Deployment Drama:
Did you run
modal setup
? Is your Modal token still feeling loved and active? If you're deploying to Hugging Face Spaces, remember Modal tokens might need special treatment as environment variables/secrets. Check the scrolls (aka Modal docs)! ๐
Let's build something amazing! ๐๐ป๐