Music shapes the tone of videos, yet creators often struggle to find soundtracks that match their video's mood and narrative. Recent text-to-music models let creators generate music from text prompts, but our formative study (N=8) shows creators struggle to construct diverse prompts, quickly review and compare tracks, and understand their impact on the video. We present VidTune, a system that supports soundtrack creation by generating diverse music options from a creator's prompt and producing contextual thumbnails for rapid review. VidTune extracts representative video subjects to ground thumbnails in context, maps each track's valence and energy onto visual cues like color and brightness, and depicts prominent genres and instruments. Creators can refine tracks through natural language edits, which VidTune expands into new generations. In a controlled user study (N=12) and an exploratory case study (N=6), participants found VidTune helpful for efficiently reviewing and comparing music options and described the process as playful and enriching.
翻译:音乐塑造了视频的基调,然而创作者常常难以找到与其视频情绪和叙事相匹配的配乐。近期的文本到音乐模型允许创作者通过文本提示生成音乐,但我们的初步研究(N=8)表明,创作者在构建多样化提示、快速审听与比较曲目以及理解音乐对视频的影响方面存在困难。我们提出了VidTune系统,该系统通过从创作者的提示生成多样化的音乐选项,并生成用于快速审听的上下文缩略图,以支持配乐创作。VidTune提取具有代表性的视频主体,使缩略图根植于上下文之中;将每首曲目的效价和能量映射到颜色、亮度等视觉线索上;并描绘突出的音乐流派和乐器。创作者可以通过自然语言编辑来优化曲目,VidTune会将这些编辑扩展为新的生成结果。在一项受控用户研究(N=12)和一项探索性案例研究(N=6)中,参与者认为VidTune有助于高效审听和比较音乐选项,并将该过程描述为有趣且富有启发性。