Training diffusion models for audiovisual sequences allows for a range of generation tasks by learning conditional distributions of various input-output combinations of the two modalities. Nevertheless, this strategy often requires training a separate model for each task which is expensive. Here, we propose a novel training approach to effectively learn arbitrary conditional distributions in the audiovisual space.Our key contribution lies in how we parameterize the diffusion timestep in the forward diffusion process. Instead of the standard fixed diffusion timestep, we propose applying variable diffusion timesteps across the temporal dimension and across modalities of the inputs. This formulation offers flexibility to introduce variable noise levels for various portions of the input, hence the term mixture of noise levels. We propose a transformer-based audiovisual latent diffusion model and show that it can be trained in a task-agnostic fashion using our approach to enable a variety of audiovisual generation tasks at inference time. Experiments demonstrate the versatility of our method in tackling cross-modal and multimodal interpolation tasks in the audiovisual space. Notably, our proposed approach surpasses baselines in generating temporally and perceptually consistent samples conditioned on the input. Project page: avdit2024.github.io
翻译:为视听序列训练扩散模型,通过学习两种模态各种输入-输出组合的条件分布,可实现一系列生成任务。然而,该策略通常需要为每个任务单独训练模型,成本高昂。本文提出一种新颖的训练方法,以有效学习视听空间中任意条件分布。我们的核心贡献在于对前向扩散过程中扩散时间步的参数化方式。我们提出在时间维度和输入模态上应用可变扩散时间步,而非标准的固定扩散时间步。这一公式化方法为输入的不同部分引入可变噪声水平提供了灵活性,故称为混合噪声级。我们提出一种基于Transformer的视听潜在扩散模型,并证明其可采用我们的方法以任务无关的方式进行训练,从而在推理时实现多种视听生成任务。实验证明了我们的方法在处理视听空间中跨模态与多模态插值任务方面的多功能性。值得注意的是,我们提出的方法在生成受输入条件约束、具有时间与感知一致性的样本方面超越了基线模型。项目页面:avdit2024.github.io