MOSS-Audio Technical Report

Chen Yang,Chufan Yu,Hanfu Chen,Jie Zhu,Jingqi Chen,Ke Chen,Wenxuan Wang,Yang Wang,Yaozhou Jiang,Yi Jiang,Zhengyuan Lin,Ziqi Chen,Zhaoye Fei,Chenghao Liu,Donghua Yu,Jun Zhan,Kang Yu,Kexin Huang,Liwei Fan,Mingshu Chen,Qinyuan Cheng,Ruixiao Li,Shimin Li,Songlin Wang,Xingjian Zhao,Yang Gao,Yitian Gong,Yiyang Zhang,Zhe Xu,Xipeng Qiu

MOSS-Audio is a unified audio-language model for speech, environmental sound, and music understanding, supporting audio captioning, time-aware question answering, timestamped transcription, and audio-grounded reasoning. MOSS-Audio couples a dedicated audio encoder with a modality adapter and a large language model: the encoder produces 12.5 Hz temporal representations, the adapter projects them into the decoder space, and the decoder generates autoregressive text outputs. Two design choices are central to the system: DeepStack cross-layer feature injection, which exposes the decoder to acoustic information from multiple encoder depths, and time markers, which provide explicit temporal cues by inserting timestamp markers into the audio-token stream. At the data level, we design an event-preserving audio annotation pipeline that segments raw audio at coherent event boundaries, applies branch-specific annotation to speech, music, and general audio, and merges the results into unified captions for pretraining. The intermediate branch-specific captions are further retained to support the construction of task-oriented SFT data. The model is pretrained on large-scale audio-language data, with time-aware objectives incorporated to support temporal grounding, and then undergoes multi-stage post-training to enhance instruction following and audio-grounded reasoning. We release 4B and 8B variants in both Instruct and Thinking configurations. MOSS-Audio achieves strong performance across general audio understanding, speech captioning, ASR, and timestamped ASR, positioning it as a promising understanding foundation for future voice agents.

翻译：MOSS-Audio是一种面向语音、环境声音和音乐理解的统一音频-语言模型，支持音频描述、时间感知问答、时间戳转录及基于音频的推理。该模型将专用音频编码器与模态适配器及大语言模型相结合：编码器产生12.5 Hz的时间表征，适配器将其映射至解码器空间，解码器生成自回归文本输出。系统的两个核心设计选择是：跨层特征注入机制（DeepStack），使解码器能获取来自编码器多深度的声学信息；以及时间标记（time markers），通过在音频标记流中插入时间戳标记提供显式时间线索。在数据层面，我们设计了保持事件完整的音频标注流程：在连贯事件边界处分割原始音频，对语音、音乐及通用音频分别进行分支特定标注，并将结果合并为统一描述用于预训练。中间的分支特定描述进一步保留以支持面向任务的监督微调数据构建。该模型在大规模音频-语言数据上预训练，并融入时间感知目标以支持时间定位，随后经过多阶段后训练以增强指令遵循与基于音频的推理能力。我们发布了4B和8B参数量的指导型及思维型两个配置版本。MOSS-Audio在通用音频理解、语音描述、自动语音识别及带时间戳的自动语音识别任务上均展现出优异性能，为未来语音代理奠定了坚实的理解基础。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

12+阅读 · 5月21日

《深度学习技术在海战舰船声景分类中的应用研究》最新63页

专知会员服务

28+阅读 · 2025年5月20日

《基于深度学习的自动无源声学监测》最新45页技术报告

专知会员服务

23+阅读 · 2024年12月15日