AudioDER: A Deduplication-Enhanced Reasoning Dataset for Post-Training Large Audio-Language Models

Large Audio-Language Models (LALMs) have shown strong performance on a wide range of audio understanding tasks, yet they still struggle with complex audio reasoning. A practical way to improve such capabilities is post-training, whose effectiveness critically depends on the quality and diversity of training data. However, existing audio-language datasets often contain substantial redundancy, where many samples are highly similar in acoustic content and thus provide overlapping supervisory signals. Such redundancy not only increases annotation cost, but also limits corpus diversity and reduces the effectiveness of post-training. To address this issue, we propose a redundancy-aware data construction pipeline for building reasoning-oriented supervision for LALMs. Specifically, we first perform acoustic similarity-based deduplication across raw audio datasets to improve corpus diversity. We then integrate existing audio captions and question-answer pairs into a unified multiple-choice format. Based on these unified annotations, we leverage Qwen3-30B to generate chain-of-thought (CoT) rationales for reasoning-oriented supervision. Based on this pipeline, we construct AudioDER, a reasoning-oriented post-training dataset containing approximately 191k samples spanning sound, speech, and music. Each sample consists of an audio clip, a multiple-choice question, four answer candidates, an audio caption, and a CoT rationale. Extensive experiments show that post-training on AudioDER consistently improves the performance of Qwen2-Audio-7B-Instruct on multiple audio reasoning benchmarks, including MMAU-mini, MMSU, and MMAR. We hope AudioDER can serve as a valuable resource for advancing audio reasoning research and the development of more capable LALMs.

翻译：大型音频语言模型在广泛的音频理解任务中展现出强大性能，但在复杂音频推理方面仍存在不足。提升此类能力的实用途径是后训练，其效果关键取决于训练数据的质量和多样性。然而，现有音频-语言数据集通常包含大量冗余样本，这些样本在声学内容上高度相似，从而提供重叠的监督信号。这种冗余不仅增加了标注成本，还限制了语料库多样性并降低了后训练效果。为解决该问题，我们提出了一种冗余感知的数据构建流水线，用于为大型音频语言模型生成面向推理的监督信号。具体而言，我们首先对原始音频数据集进行基于声学相似性的去重操作以提高语料库多样性，然后将现有音频描述和问答对统一为多选格式。基于这些统一标注，我们利用Qwen3-30B生成思维链推理过程以构建面向推理的监督信号。通过该流水线，我们构建了AudioDER数据集——包含约19.1万样本的面向推理后训练数据集，覆盖声音、语音和音乐三种类型。每个样本包含一段音频片段、一道多选题、四个候选答案、一个音频描述以及一条思维链推理过程。大量实验表明，在AudioDER上进行后训练能持续提升Qwen2-Audio-7B-Instruct在多个音频推理基准（包括MMAU-mini、MMSU和MMAR）上的性能。我们希望AudioDER能成为推动音频推理研究和更强大型音频语言模型发展的宝贵资源。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【综述】大型音频语言模型综述：泛化、可信与未来展望

专知会员服务

12+阅读 · 5月21日

从数据中心视角出发的高效大语言模型训练综述

专知会员服务

23+阅读 · 2025年10月31日

什么是后训练？大语言模型训练后优化方法综述，87页pdf

专知会员服务

54+阅读 · 2025年3月11日

如何提升大模型通用推理能力？DeepSeek最新论文《CODEI/O：通过代码输入输出预测凝练推理模式》

专知会员服务

42+阅读 · 2025年2月16日