Despite recent breakthroughs, audio foundation models struggle in processing complex multi-source acoustic scenes. We refer to this challenging domain as audio stories, which can have multiple speakers and background/foreground sound effects. Compared to traditional audio processing tasks, audio stories introduce new layers of semantic, temporal, and physical complexity. To address this challenge, we propose AudioChat, a framework for developing audio foundation models that can generate, edit, and understand audio stories. AudioChat introduces a new paradigm in which LLM-based toolcalling agents simulate interactions between users and the system, and these simulated dialogues are used as training data. We also introduce a novel Audio Transfusion Forcing objective to train the AudioChat model, allowing it to simultaneously decompose high-level instructions via structured chain-of-thought reasoning and perform interactive multi-turn audio understanding/generation. To evaluate generation and editing performance, we develop three new metrics that directly measure task performance instead of relying upon distribution-based scoring. We highly encourage readers to visit our demo to better understand the capabilities of AudioChat: https://wanchichen.github.io/audiochat/.
翻译:尽管近期取得了突破性进展,音频基础模型在处理复杂多源声学场景时仍面临困难。我们将这一挑战性领域称为音频叙事,其可能包含多位说话者及背景/前景音效。与传统音频处理任务相比,音频叙事引入了语义、时序和物理层面的新复杂度。为应对这一挑战,我们提出AudioChat框架,用于开发能够生成、编辑和理解音频叙事的音频基础模型。AudioChat引入了一种新范式:基于LLM的工具调用代理模拟用户与系统间的交互,并将这些模拟对话作为训练数据。我们还提出了新颖的Audio Transfusion Forcing训练目标,使AudioChat模型能够通过结构化思维链推理同时分解高层指令,并执行交互式多轮音频理解/生成。为评估生成与编辑性能,我们开发了三种直接衡量任务性能的新指标,而非依赖基于分布的打分方法。我们强烈建议读者访问演示页面以更好地理解AudioChat的功能:https://wanchichen.github.io/audiochat/。