Harmony: Harmonizing Audio and Video Generation through Cross-Task Synergy
Abstract
Harmony addresses audio-visual synchronization in generative AI by introducing a Cross-Task Synergy training paradigm, Global-Local Decoupled Interaction Module, and Synchronization-Enhanced CFG to improve alignment and fidelity.
The synthesis of synchronized audio-visual content is a key challenge in generative AI, with open-source models facing challenges in robust audio-video alignment. Our analysis reveals that this issue is rooted in three fundamental challenges of the joint diffusion process: (1) Correspondence Drift, where concurrently evolving noisy latents impede stable learning of alignment; (2) inefficient global attention mechanisms that fail to capture fine-grained temporal cues; and (3) the intra-modal bias of conventional Classifier-Free Guidance (CFG), which enhances conditionality but not cross-modal synchronization. To overcome these challenges, we introduce Harmony, a novel framework that mechanistically enforces audio-visual synchronization. We first propose a Cross-Task Synergy training paradigm to mitigate drift by leveraging strong supervisory signals from audio-driven video and video-driven audio generation tasks. Then, we design a Global-Local Decoupled Interaction Module for efficient and precise temporal-style alignment. Finally, we present a novel Synchronization-Enhanced CFG (SyncCFG) that explicitly isolates and amplifies the alignment signal during inference. Extensive experiments demonstrate that Harmony establishes a new state-of-the-art, significantly outperforming existing methods in both generation fidelity and, critically, in achieving fine-grained audio-visual synchronization.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- UniAVGen: Unified Audio and Video Generation with Asymmetric Cross-Modal Interactions (2025)
- Model-Guided Dual-Role Alignment for High-Fidelity Open-Domain Video-to-Audio Generation (2025)
- ProAV-DiT: A Projected Latent Diffusion Transformer for Efficient Synchronized Audio-Video Generation (2025)
- Training-Free Multimodal Guidance for Video to Audio Generation (2025)
- Taming Text-to-Sounding Video Generation via Advanced Modality Condition and Interaction (2025)
- Ovi: Twin Backbone Cross-Modal Fusion for Audio-Video Generation (2025)
- Taming Modality Entanglement in Continual Audio-Visual Segmentation (2025)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper