Unifying Speech Editing Detection and Content Localization via Prior-Enhanced Audio LLMs

Speech editing achieves semantic inversion by performing fine-grained segment-level manipulation on original utterances, while preserving global perceptual naturalness. Existing detection studies mainly focus on manually edited speech with explicit splicing artifacts, and therefore struggle to cope with emerging end-to-end neural speech editing techniques that generate seamless acoustic transitions. To address this challenge, we first construct a large-scale bilingual dataset, AiEdit, which leverages large language models to drive precise semantic tampering logic and employs multiple advanced neural speech editing methods for data synthesis, thereby filling the gap of high-quality speech editing datasets. Building upon this foundation, we propose PELM (Prior-Enhanced Audio Large Language Model), the first large-model framework that unifies speech editing detection and content localization by formulating them as an audio question answering task. To mitigate the inherent forgery bias and semantic-priority bias observed in existing audio large models, PELM incorporates word-level probability priors to provide explicit acoustic cues, and further designs a centroid-aggregation-based acoustic consistency perception loss to explicitly enforce the modeling of subtle local distribution anomalies. Extensive experimental results demonstrate that PELM significantly outperforms state-of-the-art methods on both the HumanEdit and AiEdit datasets, achieving equal error rates (EER) of 0.57\% and 9.28\% (localization), respectively.

翻译：语音编辑通过对原始话语进行细粒度片段级操作，在保持全局感知自然性的同时实现语义反转。现有检测研究主要关注具有明显拼接痕迹的人工编辑语音，因此难以应对新兴的端到端神经语音编辑技术所生成的无缝声学过渡。为应对这一挑战，我们首先构建了大规模双语数据集AiEdit，该数据集利用大语言模型驱动精确的语义篡改逻辑，并采用多种先进神经语音编辑方法进行数据合成，从而填补了高质量语音编辑数据集的空白。在此基础上，我们提出PELM（先验增强音频大语言模型），这是首个通过将语音编辑检测与内容定位统一表述为音频问答任务的大模型框架。为缓解现有音频大模型中存在的固有伪造偏差和语义优先偏差，PELM引入词级概率先验以提供显式声学线索，并进一步设计了基于质心聚合的声学一致性感知损失，以显式强化对细微局部分布异常的建模。大量实验结果表明，PELM在HumanEdit和AiEdit数据集上均显著优于现有最优方法，分别实现了0.57%和9.28%（定位任务）的等错误率。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

LLMS4ALL：大语言模型在各学科科研与应用中的综述

专知会员服务

36+阅读 · 2025年10月4日

端到端语音到语音翻译的优化方法综述

专知会员服务

7+阅读 · 2025年6月10日

迈向可控语音合成：大语言模型时代的综述

专知会员服务

23+阅读 · 2024年12月13日

《语音大语言模型》最新进展综述

专知会员服务

57+阅读 · 2024年10月8日