As the volume of video content on the internet grows rapidly, finding a suitable soundtrack remains a significant challenge. This thesis presents EMSYNC (EMotion and SYNChronization), a fast, free, and automatic solution that generates music tailored to the input video, enabling content creators to enhance their productions without composing or licensing music. Our model creates music that is emotionally and rhythmically synchronized with the video. A core component of EMSYNC is a novel video emotion classifier. By leveraging pretrained deep neural networks for feature extraction and keeping them frozen while training only fusion layers, we reduce computational complexity while improving accuracy. We show the generalization abilities of our method by obtaining state-of-the-art results on Ekman-6 and MovieNet. Another key contribution is a large-scale, emotion-labeled MIDI dataset for affective music generation. We then present an emotion-based MIDI generator, the first to condition on continuous emotional values rather than discrete categories, enabling nuanced music generation aligned with complex emotional content. To enhance temporal synchronization, we introduce a novel temporal boundary conditioning method, called "boundary offset encodings," aligning musical chords with scene changes. Combining video emotion classification, emotion-based music generation, and temporal boundary conditioning, EMSYNC emerges as a fully automatic video-based music generator. User studies show that it consistently outperforms existing methods in terms of music richness, emotional alignment, temporal synchronization, and overall preference, setting a new state-of-the-art in video-based music generation.
翻译:随着互联网视频内容的快速增长,寻找合适的配乐仍然是一个重大挑战。本文提出EMSYNC(情感与同步)——一种快速、免费且自动化的解决方案,能够根据输入视频生成定制音乐,使内容创作者无需作曲或获取音乐授权即可提升作品质量。我们的模型生成的音乐在情感和节奏上与视频保持同步。EMSYNC的核心组件是一种新颖的视频情感分类器。通过利用预训练的深度神经网络进行特征提取并保持其参数冻结,仅训练融合层,我们在提高准确性的同时降低了计算复杂度。通过在Ekman-6和MovieNet数据集上取得最先进的结果,我们证明了方法的泛化能力。另一项关键贡献是构建了一个大规模、带情感标签的MIDI数据集,用于情感音乐生成。我们进而提出了一种基于情感的MIDI生成器,这是首个以连续情感值(而非离散类别)为条件的生成器,能够生成与复杂情感内容相匹配的细腻音乐。为增强时间同步性,我们引入了一种新颖的时间边界条件方法,称为“边界偏移编码”,将音乐和弦与场景变化对齐。结合视频情感分类、基于情感的音乐生成和时间边界条件,EMSYNC成为一个全自动的基于视频的音乐生成系统。用户研究表明,在音乐丰富性、情感对齐度、时间同步性和整体偏好方面,该系统持续优于现有方法,为基于视频的音乐生成确立了新的技术标杆。