UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
Abstract
UniVA is an open-source multi-agent framework that integrates video understanding, segmentation, editing, and generation into cohesive workflows using a Plan-and-Act architecture and hierarchical memory.
While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation rightarrow multi-round editing rightarrow object segmentation rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)
Community
π UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
π Paper ο½ π Website & Demo ο½ π» Code
π§© Key Highlights
- π End-to-End Unified Video Generalist β An one-stop omni video creation framework, bridging understanding, reasoning, editing, tracking and generation into one foundation.
- π€ Agentic Video Creation β PlanβAct dual agents that understand, reason, and create videos interactively.
- π¬ Proactive Workflow β UniVA iterates with you like a director: plans shots, refines scenes, and suggests better stories.
- π§ Deep Memory & Intent Understanding β Keeps global + user memory for style, lore, and preferences consistency.
- π Industrial-Grade Production β Any-conditioned pipeline with cinematic quality, long-form consistency, and cross-modal editing.
- β MCP-Based Modular Extensible Ecosystem β MCP-native, modular plug-and-play tool servers, enabling infinite expansion with new tools and models.
- π§Ύ UniVA-Bench β A benchmark for agentic video intelligence across multi-step compositional tasks.
- β¨ Open β UniVA is fully open-source, omni-capable, and ever-extensible.
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniVideo: Unified Understanding, Generation, and Editing for Videos (2025)
- Agent-Omni: Test-Time Multimodal Reasoning via Model Coordination for Understanding Anything (2025)
- GUI-ReWalk: Massive Data Generation for GUI Agent via Stochastic Exploration and Intent-Aware Reasoning (2025)
- VISTA: A Test-Time Self-Improving Video Generation Agent (2025)
- Hollywood Town: Long-Video Generation via Cross-Modal Multi-Agent Orchestration (2025)
- ToolScope: An Agentic Framework for Vision-Guided and Long-Horizon Tool Use (2025)
- ColorAgent: Building A Robust, Personalized, and Interactive OS Agent (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper