SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

Instruction-guided speech editing requires a model to modify specified speech attributes while preserving unrelated characteristics. Despite rapid progress in Speech Large Language Models (Speech LLMs), systematic evaluation of this capability remains challenging, as existing benchmarks are fragmented across isolated editing tasks. To bridge this gap, we introduce SpeechEditBench, a bilingual multi-attribute benchmark for instruction-guided speech editing. SpeechEditBench encompasses seven atomic editing tasks, as well as compositional editing tasks that integrate multiple operations within a single instruction. We propose an anchor-based evaluation protocol that separately assesses the edit success of target attributes and the preservation of untargeted attributes, leading to three metrics: target success, preservation success, and joint success. Using this benchmark, we evaluate mainstream Speech LLMs and specialized speech editing systems. The results reveal three key findings: (1) no single model performs well across all editing dimensions; (2) closed-source Speech LLMs generally outperform open-source models; (3) compositional editing remains highly challenging, with even the most advanced models struggling to achieve high joint success. SpeechEditBench provides a rigorous diagnostic framework to identify bottlenecks in Speech LLMs, thereby facilitating the development of next-generation Speech LLMs with more robust and precise instruction-guided editing capabilities. Data and code are avaialble at https://github.com/daxintan-cuhk/SpeechEditBench .

翻译：指令引导的语音编辑要求模型在修改指定语音属性的同时保持不相关特征不变。尽管语音大语言模型（Speech LLMs）发展迅速，但对此能力的系统评估仍具挑战性，因为现有基准测试分散于孤立的编辑任务中。为弥合这一差距，我们提出SpeechEditBench——一个面向指令引导语音编辑的双语多属性基准测试。SpeechEditBench包含七项原子编辑任务，以及在单一指令中融合多项操作的组合编辑任务。我们提出基于锚点的评估协议，分别评估目标属性的编辑成功率和非目标属性的保持率，导出三个指标：目标成功率、保持成功率和联合成功率。利用该基准测试，我们评估了主流语音大语言模型和专用语音编辑系统。结果揭示三项关键发现：(1) 尚无单个模型在所有编辑维度表现优异；(2) 闭源语音大语言模型总体优于开源模型；(3) 组合编辑仍极具挑战，即使最先进的模型也难以实现高联合成功率。SpeechEditBench提供了严格的诊断框架，用以识别语音大语言模型的瓶颈，从而推动具备更强健、更精确指令引导编辑能力的下一代语音大语言模型的发展。数据和代码已开源在 https://github.com/daxintan-cuhk/SpeechEditBench 。

相关内容

属性

关注 2

一个具体事物，总是有许许多多的性质与关系，我们把一个事物的性质与关系，都叫作事物的属性。事物与属性是不可分的，事物都是有属性的事物，属性也都是事物的属性。一个事物与另一个事物的相同或相异，也就是一个事物的属性与另一事物的属性的相同或相异。由于事物属性的相同或相异，客观世界中就形成了许多不同的事物类。具有相同属性的事物就形成一类，具有不同属性的事物就分别地形成不同的类。

大型语言模型对齐技术综述：RLHF、RLAIF、PPO、DPO 等

专知会员服务

55+阅读 · 2024年7月24日

《SysEngBench：评估系统工程中大型语言模型的新基准》美海军最新报告

专知会员服务

51+阅读 · 2024年6月30日

【COLING教程】大型语言模型的知识编辑，193页ppt

专知会员服务

42+阅读 · 2024年5月30日

【AAAI2024】在多样化指令下对大型语言模型的可控生成进行基准测试

专知会员服务

29+阅读 · 2024年1月5日