MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.
翻译:MOSS-Audio是一种面向语音、环境声音和音乐理解的统一音频-语言模型,支持音频描述、时间感知问答、时间戳转录及基于音频的推理。该模型将专用音频编码器与模态适配器及大语言模型相结合:编码器产生12.5 Hz的时间表征,适配器将其映射至解码器空间,解码器生成自回归文本输出。系统的两个核心设计选择是:跨层特征注入机制(DeepStack),使解码器能获取来自编码器多深度的声学信息;以及时间标记(time markers),通过在音频标记流中插入时间戳标记提供显式时间线索。在数据层面,我们设计了保持事件完整的音频标注流程:在连贯事件边界处分割原始音频,对语音、音乐及通用音频分别进行分支特定标注,并将结果合并为统一描述用于预训练。中间的分支特定描述进一步保留以支持面向任务的监督微调数据构建。该模型在大规模音频-语言数据上预训练,并融入时间感知目标以支持时间定位,随后经过多阶段后训练以增强指令遵循与基于音频的推理能力。我们发布了4B和8B参数量的指导型及思维型两个配置版本。MOSS-Audio在通用音频理解、语音描述、自动语音识别及带时间戳的自动语音识别任务上均展现出优异性能,为未来语音代理奠定了坚实的理解基础。