Providing soundtracks for videos remains a costly and time-consuming challenge for multimedia content creators. We introduce EMSYNC, an automatic video-based symbolic music generator that creates music aligned with a video's emotional content and temporal boundaries. It follows a two-stage framework, where a pretrained video emotion classifier extracts emotional features, and a conditional music generator produces MIDI sequences guided by both emotional and temporal cues. We introduce boundary offsets, a novel temporal conditioning mechanism that enables the model to anticipate upcoming video scene cuts and align generated musical chords with them. We also propose a mapping scheme that bridges the discrete categorical outputs of the video emotion classifier with the continuous valence-arousal inputs required by the emotion-conditioned MIDI generator, enabling seamless integration of emotion information across different representations. Our method outperforms state-of-the-art models in objective and subjective evaluations across different video datasets, demonstrating its effectiveness in generating music aligned to video both emotionally and temporally. Our demo and output samples are available at https://serkansulun.com/emsync.
翻译:为视频提供配乐对多媒体内容创作者而言仍是一项成本高昂且耗时的工作。我们提出EMSYNC,一种基于视频的自动符号音乐生成器,能够创作与视频情感内容及时间边界对齐的音乐。该系统采用两阶段框架:首先通过预训练的视频情感分类器提取情感特征,随后由条件音乐生成器在情感与时间线索的共同引导下生成MIDI序列。我们引入了边界偏移量这一新颖的时间条件机制,使模型能够预判即将到来的视频场景切换,并将生成的和弦与之对齐。同时,我们提出一种映射方案,将视频情感分类器的离散分类输出与情感条件MIDI生成器所需的连续效价-唤醒度输入相连接,实现了不同表征间情感信息的无缝整合。在多个视频数据集上的主客观评估表明,本方法在情感与时间双重维度上均优于现有最优模型,验证了其生成视频适配音乐的有效性。演示系统与输出样本详见 https://serkansulun.com/emsync。