This study focuses on a challenging yet promising task, Text-to-Sounding-Video (T2SV) generation, which aims to generate a video with synchronized audio from text conditions, meanwhile ensuring both modalities are aligned with text. Despite progress in joint audio-video training, two critical challenges still remain unaddressed: (1) a single, shared text caption where the text for video is equal to the text for audio often creates modal interference, confusing the pretrained backbones, and (2) the optimal mechanism for cross-modal feature interaction remains unclear. To address these challenges, we first propose the Hierarchical Visual-Grounded Captioning (HVGC) framework that generates pairs of disentangled captions, a video caption, and an audio caption, eliminating interference at the conditioning stage. Based on HVGC, we further introduce BridgeDiT, a novel dual-tower diffusion transformer, which employs a Dual CrossAttention (DCA) mechanism that acts as a robust ``bridge" to enable a symmetric, bidirectional exchange of information, achieving both semantic and temporal synchronization. Extensive experiments on three benchmark datasets, supported by human evaluations, demonstrate that our method achieves state-of-the-art results on most metrics. Comprehensive ablation studies further validate the effectiveness of our contributions, offering key insights for the future T2SV task. All the codes and checkpoints will be publicly released.
翻译:本研究聚焦于一项具有挑战性且前景广阔的任务——文本到有声视频(T2SV)生成,其目标是根据文本条件生成具有同步音频的视频,同时确保两种模态均与文本对齐。尽管在联合音视频训练方面已取得进展,但仍存在两个关键挑战尚未解决:(1)单一、共享的文本描述(即用于视频的文本等同于用于音频的文本)常造成模态干扰,使预训练主干网络产生混淆;(2)跨模态特征交互的最佳机制仍不明确。为应对这些挑战,我们首先提出了层次化视觉基础描述生成(HVGC)框架,该框架生成解耦的描述对——一个视频描述和一个音频描述,从而在条件输入阶段消除干扰。基于HVGC,我们进一步引入了BridgeDiT,一种新颖的双塔扩散Transformer模型。该模型采用双重交叉注意力(DCA)机制,作为一个鲁棒的“桥梁”,实现对称、双向的信息交换,从而达到语义与时间上的同步。在三个基准数据集上进行的大量实验,结合人工评估,表明我们的方法在多数指标上达到了最先进的性能。全面的消融研究进一步验证了我们所提贡献的有效性,为未来的T2SV任务提供了关键见解。所有代码与模型检查点均将公开发布。