Recent text-to-audio generation techniques have the potential to allow novice users to freely generate music audio. Even if they do not have musical knowledge, such as about chord progressions and instruments, users can try various text prompts to generate audio. However, compared to the image domain, gaining a clear understanding of the space of possible music audios is difficult because users cannot listen to the variations of the generated audios simultaneously. We therefore facilitate users in exploring not only text prompts but also audio priors that constrain the text-to-audio music generation process. This dual-sided exploration enables users to discern the impact of different text prompts and audio priors on the generation results through iterative comparison of them. Our developed interface, IteraTTA, is specifically designed to aid users in refining text prompts and selecting favorable audio priors from the generated audios. With this, users can progressively reach their loosely-specified goals while understanding and exploring the space of possible results. Our implementation and discussions highlight design considerations that are specifically required for text-to-audio models and how interaction techniques can contribute to their effectiveness.
翻译:近期发展的文本到音频生成技术有潜力使无音乐背景的用户也能自由生成音乐音频。即便用户缺乏和弦进行、乐器等音乐知识,仍可尝试不同文本提示生成音频。然而,相较于图像领域,用户难以同时聆听生成音频的变体,从而清晰理解可能音乐音频空间的全貌。为此,我们不仅支持用户探索文本提示,还提供约束文本到音频音乐生成过程的音频先验。这种双向探索使用户能通过迭代对比,辨析不同文本提示与音频先验对生成结果的影响。我们开发的接口IteraTTA专为辅助用户优化文本提示、从生成音频中筛选最佳音频先验而设计。借助该接口,用户可在理解并探索可能结果空间的同时,渐进式达成其模糊定义的目标。本文的实现与讨论揭示了文本到音频模型所需的特定设计考量,以及交互技术如何提升其有效性。