Video-conditioned audio generation, including Video-to-Sound (V2S) and Visual Text-to-Speech (VisualTTS), has traditionally been treated as distinct tasks, leaving the potential for a unified generative framework largely underexplored. In this paper, we bridge this gap with VSSFlow, a unified flow-matching framework that seamlessly solve both problems. To effectively handle multiple input signals within a Diffusion Transformer (DiT) architecture, we propose a disentangled condition aggregation mechanism leveraging distinct intrinsic properties of attention layers: cross-attention for semantic conditions, and self-attention for temporally-intensive conditions. Besides, contrary to the prevailing belief that joint training for the two tasks leads to performance degradation, we demonstrate that VSSFlow maintains superior performance during end-to-end joint learning process. Furthermore, we use a straightforward feature-level data synthesis method, demonstrating that our framework provides a robust foundation that easily adapts to joint sound and speech generation using synthetic data. Extensive experiments on V2S, VisualTTS and joint generation benchmarks show that VSSFlow effectively unifies these tasks and surpasses state-of-the-art domain-specific baselines, underscoring the critical potential of unified generative models. Project page: https://vasflow1.github.io/vasflow/
翻译:视频条件音频生成(包括视频到声音(V2S)和视觉文本到语音(VisualTTS))传统上被视为两个独立任务,其统一生成框架的潜力尚未充分探索。本文通过提出VSSFlow——一个统一流匹配框架——弥合了这一差距,可无缝解决上述两个问题。为有效处理扩散Transformer(DiT)架构中的多输入信号,我们提出一种解耦条件聚合机制,该机制利用注意力层的固有属性:交叉注意力用于语义条件,自注意力用于时间密集型条件。此外,与联合训练会导致两类任务性能下降的主流观点相反,我们证明VSSFlow在端到端联合学习过程中能够保持卓越性能。进一步地,我们采用一种直接的基于特征级的数据合成方法,表明该框架为利用合成数据实现联合声音与语音生成提供了稳健的基础。在V2S、VisualTTS及联合生成基准上的大量实验表明,VSSFlow有效统一了这些任务,并超越当前最先进的领域特定基线模型,凸显了统一生成模型的关键潜力。项目主页:https://vasflow1.github.io/vasflow/