A newer version of the Gradio SDK is available:
5.34.2
metadata
title: SlideDeck AI
emoji: ๐
colorFrom: yellow
colorTo: blue
sdk: gradio
sdk_version: 5.33.1
app_file: app.py
pinned: true
license: apache-2.0
tags:
- Agents-MCP-Hackathon
- mcp-server-track
- agent-demo-track
short_description: 'Turn any document into Interactive presentation '
๐ค SlideDeck AI
An autonomous AI agent that turns your raw documents into stunning presentations.
๐ Watch the Demo!
- ๐ฅ Watch the Video Walkthrough on YouTube: [Link to the YouTube Video]
This project is a submission for the Hugging Face & Gradio Agents & MCP Hackathon. It demonstrates a powerful, multi-tool agentic pipeline that handles everything from creative direction to asset generation using Open Source LLM models.
โจ Key Features
SlideDeck AI acts as your personal creative director, designer, and narrator. It's not just a summarizer; it's a content creator.
- ๐ง Intelligent Document Analysis: Upload multiple documents (
PDF
,DOCX
,TXT
), and the agent uses LlamaParse to understand and synthesize the core information. - ๐จ AI Creative Direction: The agent analyzes your topic and generates a complete visual theme, including a color palette, font pairings, and a compelling title for your presentation.
- ๐ผ๏ธ Contextual AI Image Generation: For high-impact slides, the agent writes its own detailed prompts and uses them to generate beautiful, custom images that match the slide's content.
- ๐ค AI-Narrated Speaker Notes: Every slide comes with a full set of speaker notes, which are then converted into high-quality audio narration for you to listen to.
- ๐ Dynamic CSS Visualizations: When an image isn't needed, the agent builds data-driven visualizations (like charts and diagrams) using pure HTML and CSS.
- ๐ One-Click Export: Download your final, beautifully designed presentation as a self-contained HTML file or a shareable PDF.
โ๏ธ How It Works: The Agentic Pipeline
SlideDeck AI uses a chain of specialized tools and AI models to achieve its results. Each step passes its output to the next in a seamless agentic workflow:
- Parse Tool (
LlamaParse
): The agent takes your uploaded files and uses LlamaParse to convert them into clean, structured Markdown text. - Creative Plan Tool (
Sambanova
): This Markdown text is fed to a powerful LLM on Sambanova. The LLM acts as a Creative Director, outputting a detailed JSON "master plan" that defines the entire presentationโslide titles, key points, speaker notes, and even prompts for image generation. - Image Asset Tool (
Nebius
): The agent extracts the image prompts from the JSON plan and sends them to the Flux-Schnell model via Nebius to generate visual assets. - Audio Asset Tool (
Modal
): The speaker notes from the JSON plan are sent to a custom Text-to-Speech model (https://huggingface.co/hexgrad/Kokoro-82M) deployed on Modal, which returns audio files for each slide. - HTML Builder Tool (
Nebius
): Finally, the agent combines the JSON plan (now updated with image and audio URLs) and feeds it to another powerful LLM on Nebius. This model writs the complete, final HTML and CSS for the presentation.
โ ๏ธ Known Issues & Future Improvements
This project was built within the tight timeframe of a hackathon. Here are a few known limitations and how I plan to address them in the future:
1. PDF Export Quality
- The Issue: The downloaded PDF file may not perfectly match the beautiful layout seen in the "Final Presentation" tab. Complex CSS elements like advanced grids or custom fonts can sometimes be rendered incorrectly.
- The Cause: The PDF is generated using the
weasyprint
library. While powerful, it's not a full web browser engine and can struggle with the very modern and complex CSS that the AI agent generates. - The Workaround: For a pixel-perfect view, please use the "Final Presentation" tab directly in the UI. For sharing, you can right-click -> "Save As" on that page to get a self-contained HTML file that will look perfect in any modern browser.
- Roadmap: The gold-standard solution is to integrate a headless browser like Playwright. This would allow the app to take a perfect "screenshot" of the rendered HTML page and save it as a high-fidelity PDF.
2. Presentation Generation Speed
- The Issue: The final step, where the agent builds the HTML code, can take some time (4-6 minutes).
- The Cause: This step deliberately uses a large, powerful reasoning model (DeepSeek-R1-0528 via Nebius) to act as an expert front-end developer. This model's strength is its high-quality, complex code generation, which comes at the cost of higher latency. This was a conscious trade-off to prioritize the quality of the final presentation over raw speed.
- The Workaround: Be patient and watch the logs! The UI provides real-time feedback so you know the agent is hard at work "thinking" and "coding" your presentation.
- Roadmap: The ideal enhancement would be to stream the model's output. Instead of waiting for the full HTML file, the code would appear in the "Raw HTML Code" tab token-by-token, creating an amazing "live coding" effect. This would dramatically improve the perceived performance and user experience.
3. Audio Generation Latency
- The Issue: Generating the audio narration for each slide can feel slow, with each individual audio file taking around 15 seconds to create.
- The Cause: The audio is generated by a high-quality Text-to-Speech (TTS) model deployed on a CPU instance on Modal. For cost-efficiency and broad accessibility during the hackathon, this CPU-based approach was chosen. While reliable, CPU-based inference for speech synthesis is significantly slower than its GPU-accelerated counterpart.
- The Workaround: The UI is designed to be fully asynchronous. You don't have to wait for all audio to finish before interacting with the rest of the generated presentation. The audio players for each slide will appear in the "Speaker Notes Audio" tab as soon as they are ready.
- Roadmap: The path to near-instant audio generation involves migrating the TTS model to a GPU-based environment. By leveraging a Hugging Face Space with a GPU upgrade (like a T4 or A10G) or a dedicated GPU endpoint on Modal, the inference time per slide could be reduced from ~15 seconds to just 1-2 seconds.
Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference