Towards Automatic Evaluation and High-Quality Pseudo-Parallel Dataset Construction for Audio Editing: A Human-in-the-Loop Method

Audio editing aims to manipulate audio content based on textual descriptions, supporting tasks such as adding, removing, or replacing audio events. Despite recent progress, the lack of high-quality benchmark datasets and comprehensive evaluation metrics remains a major challenge for both assessing audio editing quality and improving the task itself. In this work, we propose a novel approach for audio editing task by incorporating expert knowledge into both the evaluation and dataset construction processes: 1) First, we establish AuditScore, the first comprehensive dataset for subjective evaluation of audio editing, consisting of over 6,300 edited samples generated from 7 representative audio editing frameworks and 23 system configurations. Each sample is annotated by professional raters on three key aspects of audio editing quality: overall Quality, Relevance to editing intent, and Faithfulness to original features. 2) Based on this dataset, we systematically propose AuditEval, a family of automatic MOS-style evaluators tailored for audio editing, covering both SSL-based and LLM-based approaches. It addresses the lack of effective objective metrics and the prohibitive cost of subjective evaluation in this field. 3) We further leverage AuditEval to evaluate and filter a large amount of synthetically mixed editing pairs, mining a high-quality pseudo-parallel subset by selecting the most plausible samples. Comprehensive experiments validate that our expert-informed filtering strategy effectively yields higher-quality data, while also exposing the limitations of traditional objective metrics and the advantages of AuditEval. The dataset, codes and tools can be found at: https://github.com/NKU-HLT/AuditEval.

翻译：音频编辑旨在根据文本描述对音频内容进行操控，支持添加、移除或替换音频事件等任务。尽管近期取得进展，但高质量基准数据集的缺乏以及综合性评估指标的缺失，仍然是评估音频编辑质量和改进任务本身的主要挑战。在本工作中，我们提出了一种新颖的音频编辑任务方法，将专家知识融入评估和数据集构建两个过程：1）首先，我们建立了AuditScore，这是首个用于音频编辑主观评估的综合性数据集，包含来自7个代表性音频编辑框架和23种系统配置生成的超过6,300个编辑样本。每个样本均由专业评分员从音频编辑质量的三个关键维度进行标注：整体质量、与编辑意图的相关性以及对原始特征的忠实度。2）基于此数据集，我们系统性地提出了AuditEval，一套专为音频编辑定制的自动MOS风格评估器系列，涵盖基于自监督学习和基于大语言模型的方法。它解决了该领域缺乏有效客观指标以及主观评估成本过高的问题。3）我们进一步利用AuditEval评估并筛选大量合成混合的编辑对，通过选择最合理的样本来挖掘一个高质量的伪平行子集。综合实验验证了我们基于专家知识的过滤策略能有效产生更高质量的数据，同时也揭示了传统客观指标的局限性以及AuditEval的优势。数据集、代码和工具可见于：https://github.com/NKU-HLT/AuditEval。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

文本、视觉与语音生成的自动化评估方法综述

专知会员服务

20+阅读 · 2025年6月15日

【CVPR2025】基于低秩专家混合机制的视觉语言模型终身知识编辑

专知会员服务

14+阅读 · 2025年4月14日

【博士论文】提高预训练文本生成音乐模型的可控性和可编辑性

专知会员服务

17+阅读 · 2024年11月20日

《AI生成视频评估综述》

专知会员服务

28+阅读 · 2024年10月30日