Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.
翻译:大型音频语言模型在广泛的音频理解任务中展现出强大性能,但在复杂音频推理方面仍存在不足。提升此类能力的实用途径是后训练,其效果关键取决于训练数据的质量和多样性。然而,现有音频-语言数据集通常包含大量冗余样本,这些样本在声学内容上高度相似,从而提供重叠的监督信号。这种冗余不仅增加了标注成本,还限制了语料库多样性并降低了后训练效果。为解决该问题,我们提出了一种冗余感知的数据构建流水线,用于为大型音频语言模型生成面向推理的监督信号。具体而言,我们首先对原始音频数据集进行基于声学相似性的去重操作以提高语料库多样性,然后将现有音频描述和问答对统一为多选格式。基于这些统一标注,我们利用Qwen3-30B生成思维链推理过程以构建面向推理的监督信号。通过该流水线,我们构建了AudioDER数据集——包含约19.1万样本的面向推理后训练数据集,覆盖声音、语音和音乐三种类型。每个样本包含一段音频片段、一道多选题、四个候选答案、一个音频描述以及一条思维链推理过程。大量实验表明,在AudioDER上进行后训练能持续提升Qwen2-Audio-7B-Instruct在多个音频推理基准(包括MMAU-mini、MMSU和MMAR)上的性能。我们希望AudioDER能成为推动音频推理研究和更强大型音频语言模型发展的宝贵资源。