File size: 3,467 Bytes
b0c020d
 
bfebc17
b0c020d
 
 
 
 
 
 
613b622
bfebc17
 
1ff4e78
b0c020d
 
bfebc17
 
 
 
 
 
5dc46ee
bfebc17
 
 
 
 
 
 
 
 
 
 
 
 
2422e19
bfebc17
 
 
 
 
 
 
2422e19
 
 
bfebc17
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
b0c020d
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
---
title: ClipScript
emoji: '🎬'
colorFrom: pink
colorTo: gray
sdk: gradio
sdk_version: 5.33.1
app_file: app.py
pinned: false
license: mit
short_description: Transforms videos and audio into ready-to-publish blogs.
tags:
  - agent-demo-track
video_overview: https://youtu.be/8DUxlj79NqM 
---

# 🎬 ClipScript: Video-to-Blog Transformer

ClipScript is a powerful application that transforms any video or audio content into a polished, ready-to-publish blog post. Simply provide a YouTube URL or upload an audio file, and let our AI agent handle the rest.

### Video Overview

[Watch a video demonstrating how to use ClipScript and what it is abut here!](https://youtu.be/8DUxlj79NqM)

## Features

- **YouTube & File Uploads**: Works with YouTube links or direct audio/video file uploads.
- **AI-Powered Transcription**: Utilizes a state-of-the-art ASR model for highly accurate transcription.
- **Agentic Blog Generation**: An expert AI writing agent converts the raw transcript into a structured, engaging blog post, automatically removing conversational filler and adding SEO-friendly formatting.
- **Interactive Refinement**: Chat with the AI agent to refine the generated blog post until it's perfect.
- **Secure & Scalable**: Powered by [Modal](https://modal.com) for secure, scalable, and efficient backend processing.

## Hugging Face Agent Demo Track

This application has been submitted to the **Agent Demo Track**. It showcases an "AI agent" that acts as an expert blog writer and editor, taking a high-level goal (transforming a transcript) and executing a series of steps to achieve it.

## Core Technology

### Speech-to-Text: NVIDIA Parakeet TDT 0.6B V2

The transcription engine is powered by `nvidia/parakeet-tdt-0.6b-v2`. This model is **ranked #1 on the Hugging Face Open ASR Leaderboard**, achieving the best overall average Word Error Rate (WER) and RTFx (real-time factor) score, making it one of the fastest and most accurate ASR models available.

For a deep dive into the model's architecture and performance, check out the [official model card](https://huggingface.co/nvidia/parakeet-tdt-0.6b-v2) and the [Open ASR Leaderboard](https://huggingface.co/spaces/hf-audio/open_asr_leaderboard).


For audio longer than 30 minutes, the SST model automatically segments content into optimal chunks and processes them in parallel, enabling fast transcription of hours-long content while maintaining accuracy and context.

### Content Generation: AI Writing Agent

An AI writing agent, accessed via OpenRouter, converts the raw transcript into a polished, structured blog post, ready for publishing.

### Backend Infrastructure: Modal

The backend is built on [Modal](https://modal.com) for security, scalability, and performance.

- **Secure Sandboxed Execution**: All media processing occurs in isolated Modal environments, keeping potentially malicious files separate from the Gradio server.

- **High-Performance File System**: Modal Volumes provide fast, reliable file transfer and access for user uploads.

This architecture keeps the frontend lightweight while offloading intensive tasks to secure, scalable cloud resources.

## Architecture 

The following diagram illustrates the complete data flow, from user input in the Gradio application to the final blog post generation.

![Application Architecture Diagram](https://ibb.co/SDW7NPHg)

Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference