We introduce a film score generation framework to harmonize visual pixels and music melodies utilizing a latent diffusion model. Our framework processes film clips as input and generates music that aligns with a general theme while offering the capability to tailor outputs to a specific composition style. Our model directly produces music from video, utilizing a streamlined and efficient tuning mechanism on ControlNet. It also integrates a film encoder adept at understanding the film's semantic depth, emotional impact, and aesthetic appeal. Additionally, we introduce a novel, effective yet straightforward evaluation metric to evaluate the originality and recognizability of music within film scores. To fill this gap for film scores, we curate a comprehensive dataset of film videos and legendary original scores, injecting domain-specific knowledge into our data-driven generation model. Our model outperforms existing methodologies in creating film scores, capable of generating music that reflects the guidance of a maestro's style, thereby redefining the benchmark for automated film scores and laying a robust groundwork for future research in this domain. The code and generated samples are available at https://anonymous.4open.science/r/HPM.
翻译:我们提出了一种电影配乐生成框架,利用潜在扩散模型实现视觉像素与音乐旋律的和谐统一。该框架以电影片段作为输入,生成符合影片整体主题的音乐,同时具备根据特定作曲风格定制输出的能力。我们的模型直接从视频生成音乐,采用基于ControlNet的简化高效调优机制,并集成了一个能够理解电影语义深度、情感冲击力与美学吸引力的电影编码器。此外,我们引入了一种新颖、有效且简洁的评估指标,用于衡量电影配乐中音乐的原创性与可识别性。为填补电影配乐领域的数据空白,我们构建了一个包含电影视频与传奇原创配乐的综合性数据集,将领域专业知识注入数据驱动的生成模型中。我们的模型在电影配乐创作上超越了现有方法,能够生成体现大师风格引导的音乐,从而重新定义了自动化电影配乐的基准,并为该领域的未来研究奠定了坚实基础。代码与生成样本发布于 https://anonymous.4open.science/r/HPM。