Motion, speech, and sound effects are fundamental elements of human-centric videos, yet their heterogeneous temporal characteristics make joint generation highly challenging. Existing audio-video generation models often fail to maintain consistent alignment across these modalities, leading to noticeable mismatches between motion, speech, and environmental sounds. We present Unison, a unified framework that explicitly promotes coherence across the motion, speech, and sound modalities. Within the audio stream, Unison employs a semantic-guided harmonization strategy that decouples the generation of speech and sound-effect components. Leveraging bidirectional audio cross-attention and semantic-conditioned gating for semantic-driven adaptive recomposition, this approach effectively mitigates speech dominance and enhances acoustic clarity. For audio-motion synchronization, we propose a bidirectional cross-modal forcing strategy where the cleaner modality guides the noisier one through decoupled denoising schedules, reinforced by a progressive stabilization strategy. Extensive experiments demonstrate that Unison achieves state-of-the-art performance in both audio perceptual quality and cross-modal synchronization, highlighting the importance of explicit multimodal harmonization in human-centric video generation.
翻译:摘要:动作、语音和音效是以人为中心的视频的基本要素,然而它们异质的时间特性使得联合生成极具挑战性。现有的音视频生成模型通常无法在这些模态间保持一致的同步,导致动作、语音和环境声音之间存在显著的不匹配。我们提出了Unison,一个统一的框架,明确促进动作、语音和声音模态之间的连贯性。在音频流中,Unison采用语义引导的调和策略,将语音和音效成分的生成解耦。该策略利用双向音频交叉注意力与语义条件门控实现语义驱动的自适应重组,有效缓解了语音主导性问题并提升了声学清晰度。针对音频-动作同步,我们提出了一种双向跨模态强制策略,其中较干净的模态通过解耦的去噪调度引导较嘈杂的模态,并通过渐进稳定策略加以强化。大量实验表明,Unison在音频感知质量和跨模态同步方面均达到了最先进性能,凸显了在以人为中心的视频生成中显式多模态调和的重要性。