Papers
arxiv:2511.08521

UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist

Published on Nov 11
Β· Submitted by Hao Fei on Nov 14
#3 Paper of the day
Β· UniVA-Agent UniVA
Authors:
,
,
,
,
,
,
,
,
,

Abstract

UniVA is an open-source multi-agent framework that integrates video understanding, segmentation, editing, and generation into cohesive workflows using a Plan-and-Act architecture and hierarchical memory.

AI-generated summary

While specialized AI models excel at isolated video tasks like generation or understanding, real-world applications demand complex, iterative workflows that combine these capabilities. To bridge this gap, we introduce UniVA, an open-source, omni-capable multi-agent framework for next-generation video generalists that unifies video understanding, segmentation, editing, and generation into cohesive workflows. UniVA employs a Plan-and-Act dual-agent architecture that drives a highly automated and proactive workflow: a planner agent interprets user intentions and decomposes them into structured video-processing steps, while executor agents execute these through modular, MCP-based tool servers (for analysis, generation, editing, tracking, etc.). Through a hierarchical multi-level memory (global knowledge, task context, and user-specific preferences), UniVA sustains long-horizon reasoning, contextual continuity, and inter-agent communication, enabling interactive and self-reflective video creation with full traceability. This design enables iterative and any-conditioned video workflows (e.g., text/image/video-conditioned generation rightarrow multi-round editing rightarrow object segmentation rightarrow compositional synthesis) that were previously cumbersome to achieve with single-purpose models or monolithic video-language models. We also introduce UniVA-Bench, a benchmark suite of multi-step video tasks spanning understanding, editing, segmentation, and generation, to rigorously evaluate such agentic video systems. Both UniVA and UniVA-Bench are fully open-sourced, aiming to catalyze research on interactive, agentic, and general-purpose video intelligence for the next generation of multimodal AI systems. (https://univa.online/)

Community

Paper author Paper submitter

πŸš€ UniVA: Universal Video Agent towards Open-Source Next-Generation Video Generalist
πŸ”— Paper | 🌐 Website & Demo | πŸ’» Code

🧩 Key Highlights

  • 🌍 End-to-End Unified Video Generalist β€” An one-stop omni video creation framework, bridging understanding, reasoning, editing, tracking and generation into one foundation.
  • πŸ€– Agentic Video Creation β€” Plan–Act dual agents that understand, reason, and create videos interactively.
  • 🎬 Proactive Workflow β€” UniVA iterates with you like a director: plans shots, refines scenes, and suggests better stories.
  • 🧠 Deep Memory & Intent Understanding β€” Keeps global + user memory for style, lore, and preferences consistency.
  • 🏭 Industrial-Grade Production β€” Any-conditioned pipeline with cinematic quality, long-form consistency, and cross-modal editing.
  • βš™ MCP-Based Modular Extensible Ecosystem β€” MCP-native, modular plug-and-play tool servers, enabling infinite expansion with new tools and models.
  • 🧾 UniVA-Bench β€” A benchmark for agentic video intelligence across multi-step compositional tasks.
  • ✨ Open β€” UniVA is fully open-source, omni-capable, and ever-extensible.

This is an automated message from the Librarian Bot. I found the following papers similar to this paper.

The following papers were recommended by the Semantic Scholar API

Please give a thumbs up to this comment if you found it helpful!

If you want recommendations for any Paper on Hugging Face checkout this Space

You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2511.08521 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2511.08521 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2511.08521 in a Space README.md to link it from this page.

Collections including this paper 1