In daily life, we encounter a variety of sounds, both desirable and undesirable, with limited control over their presence and volume. Our work introduces "Listen, Chat, and Edit" (LCE), a novel multimodal sound mixture editor that modifies each sound source in a mixture based on user-provided text instructions. LCE distinguishes itself with a user-friendly chat interface and its unique ability to edit multiple sound sources simultaneously within a mixture, without needing to separate them. Users input open-vocabulary text prompts, which are interpreted by a large language model to create a semantic filter for editing the sound mixture. The system then decomposes the mixture into its components, applies the semantic filter, and reassembles it into the desired output. We developed a 160-hour dataset with over 100k mixtures, including speech and various audio sources, along with text prompts for diverse editing tasks like extraction, removal, and volume control. Our experiments demonstrate significant improvements in signal quality across all editing tasks and robust performance in zero-shot scenarios with varying numbers and types of sound sources.
翻译:日常生活中,我们会遇到各种既悦耳又恼人的声音,却对它们的存在和音量缺乏控制。本研究提出"听、聊、改"(LCE)——一种新颖的多模态声音混合编辑器,能够根据用户提供的文本指令修改混合音中的每个声源。LCE以用户友好的聊天界面脱颖而出,其独特优势在于无需分离声源即可同时编辑混合音中的多个声音。用户输入开放式文本提示,由大语言模型解析后生成用于编辑混合音的语义滤波器。系统随后将混合音分解成各组成部分,应用语义滤波器,再重新组合成所需输出。我们构建了一个包含160小时音频、超过10万组混合音(涵盖语音及其他多种声源)的数据集,并配有用于提取、移除、音量控制等多样化编辑任务的文本提示。实验表明,该方法在所有编辑任务中均显著提升了信号质量,并在零样本场景下对不同数量与类型声源的混合音展现出稳健性能。