Music shapes the tone of videos, yet creators often struggle to find soundtracks that match their video's mood and narrative. Recent text-to-music models let creators generate music from text prompts, but our formative study (N=8) shows creators struggle to construct diverse prompts, quickly review and compare tracks, and understand their impact on the video. We present VidTune, a system that supports soundtrack creation by generating diverse music options from a creator's prompt and producing contextual thumbnails for rapid review. VidTune extracts representative video subjects to ground thumbnails in context, maps each track's valence and energy onto visual cues like color and brightness, and depicts prominent genres and instruments. Creators can refine tracks through natural language edits, which VidTune expands into new generations. In a controlled user study (N=12) and an exploratory case study (N=6), participants found VidTune helpful for efficiently reviewing and comparing music options and described the process as playful and enriching.
翻译:音乐塑造视频的基调,但创作者常难以找到与其视频情绪和叙事相匹配的配乐。近期的文本到音乐模型允许创作者通过文本提示生成音乐,然而我们的形成性研究(N=8)表明,创作者在构建多样化提示、快速审听比较曲目以及理解音乐对视频的影响方面存在困难。我们提出了VidTune系统,该系统通过从创作者提示生成多样化音乐选项,并生成用于快速审阅的上下文缩略图,以支持配乐创作。VidTune提取代表性视频主体以将缩略图置于上下文中,将每首曲目的效价与能量映射到颜色和亮度等视觉线索上,并描绘突出的流派和乐器。创作者可通过自然语言编辑精修曲目,VidTune会将其扩展为新的生成版本。在一项受控用户研究(N=12)和一项探索性案例研究(N=6)中,参与者认为VidTune能有效协助高效审阅和比较音乐选项,并将该过程描述为富有乐趣且具有启发性。