Audio is an inherently interactive modality, yet today's Large Audio Language Models (LALMs) are offline, and streaming audio models each handle only a single task such as streaming ASR or voice chatting. It is time to unify them into one online LALM: a model that, through an always-on perceive-decide-respond loop, listens to sound, environment, and instructions in real time and reacts on the fly. We formalize this regime as the Audio Interaction Model, and realize it with Audio-Interaction, a unified streaming model that retains offline task execution while adding online general audio instruction following, from dialogue to full voice chatting, deciding when to respond from the semantics of the stream. To enable this, we propose SoundFlow, a framework that instantiates the perceive-decide-respond loop end to end, from data to training to deployment, through streaming-native data construction, comprehension-aware training, and asynchronous low-latency inference for stable real-time interaction. We further construct StreamAudio-2M, a 2.6M-item streaming corpus spanning 7 fundamental abilities and 28 sub-tasks, and Proactive-Sound-Bench for evaluating proactive audio intervention. Across 8 benchmarks, Audio-Interaction preserves competitive performance on mainstream audio tasks while unlocking capabilities inaccessible to offline LALMs, including real-time ASR, streaming audio instruction following, and proactive help.
翻译:音频本质上是一种交互式模态,然而当前的大型音频语言模型(LALMs)均为离线形式,而流式音频模型各自仅处理单一任务,如流式自动语音识别或语音对话。是时候将它们统一为一个在线LALM:一种通过始终开启的"感知-决策-响应"循环,实时收听声音、环境与指令并即时反应的模型。我们将此体系定义为音频交互模型,并通过Audio-Interaction实现——一个统一的流式模型,它在保留离线任务执行能力的同时,新增了在线通用音频指令跟随功能,覆盖从对话到完整语音交互的场景,并能根据流式语义决定何时响应。为实现这一目标,我们提出SoundFlow框架,该框架通过流式原生数据构建、理解感知训练以及异步低延迟推理,端到端实例化了从数据、训练到部署的"感知-决策-响应"循环,从而支持稳定的实时交互。我们进一步构建了包含260万条流式数据项的StreamAudio-2M语料库,涵盖7个基础能力与28个子任务,并设计了Proactive-Sound-Bench用于评估主动式音频干预能力。在8项基准测试中,Audio-Interaction在保持主流音频任务竞争力的同时,解锁了离线LALM无法实现的能力,包括实时语音识别、流式音频指令跟随以及主动式辅助。