V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-Zero, a zero-pair video-to-music generation approach that outputs time-aligned music for video. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-Zero achieves substantial gains over paired-data baselines: 5-21% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Overall, our results validate that temporal alignment through within-modality features, rather than paired cross-modal supervision, is effective for video-to-music generation. Results are available at https://genjib.github.io/v2m_zero/

翻译：为视频生成在时间上与视觉事件对齐的音乐，对于现有的文本到音乐模型而言具有挑战性，因为它们缺乏细粒度的时间控制。我们提出了V2M-Zero，一种无需配对数据的视频到音乐生成方法，能够为视频输出时间对齐的音乐。我们的方法基于一个关键观察：时间同步需要匹配变化发生的时间和变化量，而非变化的内容。尽管音乐事件和视觉事件在语义上不同，但它们展现出共享的时间结构，这种结构可以在各自模态内独立捕获。我们通过使用预训练的音乐和视频编码器，基于模态内相似性计算事件曲线来捕获这种结构。通过独立测量每个模态内的时间变化，这些曲线提供了跨模态的可比表示。这使得一个简单的训练策略成为可能：在音乐事件曲线上微调一个文本到音乐模型，然后在推理时替换为视频事件曲线，而无需跨模态训练或配对数据。在OES-Pub、MovieGenBench-Music和AIST++数据集上，V2M-Zero相比基于配对数据的基线方法取得了显著提升：音频质量提高5-21%，语义对齐度提升13-15%，时间同步性改善21-52%，在舞蹈视频上的节拍对齐度提高28%。通过大规模众包主观听力测试，我们得到了相似的结果。总体而言，我们的结果验证了通过模态内特征（而非配对的跨模态监督）实现时间对齐，对于视频到音乐生成是有效的。结果可见于 https://genjib.github.io/v2m_zero/