Audio Interaction Model

Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.

翻译：音频本质上是一种交互式模态，然而当前的大型音频语言模型（LALMs）均为离线形式，而流式音频模型各自仅处理单一任务，如流式自动语音识别或语音对话。是时候将它们统一为一个在线LALM：一种通过始终开启的"感知-决策-响应"循环，实时收听声音、环境与指令并即时反应的模型。我们将此体系定义为音频交互模型，并通过Audio-Interaction实现——一个统一的流式模型，它在保留离线任务执行能力的同时，新增了在线通用音频指令跟随功能，覆盖从对话到完整语音交互的场景，并能根据流式语义决定何时响应。为实现这一目标，我们提出SoundFlow框架，该框架通过流式原生数据构建、理解感知训练以及异步低延迟推理，端到端实例化了从数据、训练到部署的"感知-决策-响应"循环，从而支持稳定的实时交互。我们进一步构建了包含260万条流式数据项的StreamAudio-2M语料库，涵盖7个基础能力与28个子任务，并设计了Proactive-Sound-Bench用于评估主动式音频干预能力。在8项基准测试中，Audio-Interaction在保持主流音频任务竞争力的同时，解锁了离线LALM无法实现的能力，包括实时语音识别、流式音频指令跟随以及主动式辅助。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

12+阅读 · 5月21日

音视频大数据基础模型全面综述

专知会员服务

9+阅读 · 5月7日

从感知到认知：多模态大语言模型中视觉-语言交互推理综述

专知会员服务

32+阅读 · 2025年10月1日

《口语语言模型研究现状：一项全面综述》

专知会员服务

16+阅读 · 2025年4月14日