Audio editing aims to manipulate audio content based on textual descriptions, supporting tasks such as adding, removing, or replacing audio events. Despite recent progress, the lack of high-quality benchmark datasets and comprehensive evaluation metrics remains a major challenge for both assessing audio editing quality and improving the task itself. In this work, we propose a novel approach for audio editing task by incorporating expert knowledge into both the evaluation and dataset construction processes: 1) First, we establish AuditScore, the first comprehensive dataset for subjective evaluation of audio editing, consisting of over 6,300 edited samples generated from 7 representative audio editing frameworks and 23 system configurations. Each sample is annotated by professional raters on three key aspects of audio editing quality: overall Quality, Relevance to editing intent, and Faithfulness to original features. 2) Based on this dataset, we systematically propose AuditEval, a family of automatic MOS-style evaluators tailored for audio editing, covering both SSL-based and LLM-based approaches. It addresses the lack of effective objective metrics and the prohibitive cost of subjective evaluation in this field. 3) We further leverage AuditEval to evaluate and filter a large amount of synthetically mixed editing pairs, mining a high-quality pseudo-parallel subset by selecting the most plausible samples. Comprehensive experiments validate that our expert-informed filtering strategy effectively yields higher-quality data, while also exposing the limitations of traditional objective metrics and the advantages of AuditEval. The dataset, codes and tools can be found at: https://github.com/NKU-HLT/AuditEval.
翻译:音频编辑旨在根据文本描述对音频内容进行操控,支持添加、移除或替换音频事件等任务。尽管近期取得进展,但高质量基准数据集的缺乏以及综合性评估指标的缺失,仍然是评估音频编辑质量和改进任务本身的主要挑战。在本工作中,我们提出了一种新颖的音频编辑任务方法,将专家知识融入评估和数据集构建两个过程:1)首先,我们建立了AuditScore,这是首个用于音频编辑主观评估的综合性数据集,包含来自7个代表性音频编辑框架和23种系统配置生成的超过6,300个编辑样本。每个样本均由专业评分员从音频编辑质量的三个关键维度进行标注:整体质量、与编辑意图的相关性以及对原始特征的忠实度。2)基于此数据集,我们系统性地提出了AuditEval,一套专为音频编辑定制的自动MOS风格评估器系列,涵盖基于自监督学习和基于大语言模型的方法。它解决了该领域缺乏有效客观指标以及主观评估成本过高的问题。3)我们进一步利用AuditEval评估并筛选大量合成混合的编辑对,通过选择最合理的样本来挖掘一个高质量的伪平行子集。综合实验验证了我们基于专家知识的过滤策略能有效产生更高质量的数据,同时也揭示了传统客观指标的局限性以及AuditEval的优势。数据集、代码和工具可见于:https://github.com/NKU-HLT/AuditEval。