V2M-Zero: Zero-Pair Time-Aligned Video-to-Music Generation

Generating music that temporally aligns with video events is challenging for existing text-to-music models, which lack fine-grained temporal control. We introduce V2M-ZERO, a video-to-music generation approach that generates time-aligned music with disentangled time synchronization and semantic control (e.g., genre, mood) from video while requiring zero video-music pairs at training time. Our method is motivated by a key observation: temporal synchronization requires matching when and how much change occurs, not what changes. While musical and visual events differ semantically, they exhibit shared temporal structure that can be captured independently within each modality. We capture this structure through event curves computed from intra-modal similarity using pretrained music and video encoders. By measuring temporal change within each modality independently, these curves provide comparable representations across modalities. This enables a simple training strategy: fine-tune a text-to-music model on music-event curves, then substitute video-event curves at inference without cross-modal training or paired data. Across OES-Pub, MovieGenBench-Music, and AIST++, V2M-ZERO achieves state-of-the-art performance without any paired music-video data, surpassing the strongest prior baselines per metric with 5-9% higher audio quality, 13-15% better semantic alignment, 21-52% improved temporal synchronization, and 28% higher beat alignment on dance videos. We find similar results via a large crowd-source subjective listening test. Our results validate that temporal alignment through within-modality features is not only effective for video-to-music generation but also leads to better performance than paired cross-modal supervision. Furthermore, our approach enables independent controls for timing and music style (e.g., genre, mood) for more controllable generation.

翻译：为视频事件生成时间对齐的音乐对现有文本到音乐模型构成挑战，这些模型缺乏细粒度的时间控制能力。我们提出V2M-ZERO，一种视频到音乐生成方法，能从视频生成时间对齐的音乐，同时实现时间同步与语义控制（如风格、情绪）的解耦，且训练时无需任何视频-音乐配对数据。该方法基于一个关键观察：时间同步需要匹配变化发生的时机与幅度，而非变化的内容。尽管音乐与视觉事件语义不同，但它们展现出可分别在各模态内捕获的共享时间结构。我们通过预训练音乐与视频编码器计算模态内相似度得到事件曲线来捕获该结构。由于在各模态内独立测量时间变化，这些曲线提供了跨模态的可比表征。这使得一种简单训练策略成为可能：在音乐事件曲线上微调文本到音乐模型，然后在推理时替换为视频事件曲线，无需跨模态训练或配对数据。在OES-Pub、MovieGenBench-Music和AIST++数据集上，V2M-ZERO在无需任何配对音乐-视频数据的情况下实现了最先进性能，在各项指标上超越最强的先前基线：音频质量提升5-9%，语义对齐改善13-15%，时间同步精度提高21-52%，舞蹈视频节拍对齐提升28%。大规模众包主观听测实验验证了相似结果。我们的结果证实：通过模态内特征实现时间对齐不仅对视频到音乐生成有效，而且能获得优于配对跨模态监督的性能。此外，该方法支持对时序和音乐风格（如流派、情绪）的独立控制，实现更可控的生成。