Existing speech editing detection (SED) datasets are predominantly constructed using manual splicing or limited editing operations, resulting in restricted diversity and poor coverage of realistic editing scenarios. Meanwhile, current SED methods rely heavily on frame-level supervision to detect observable acoustic anomalies, which fundamentally limits their ability to handle deletion-type edits, where the manipulated content is entirely absent from the signal. To address these challenges, we present a unified framework that bridges speech editing detection and content localization through a generative formulation based on Audio Large Language Models (Audio LLMs). We first introduce AiEdit, https://huggingface.co/datasets/JunXueTech/AiEdit, a large-scale bilingual dataset (approximately 140 hours) that covers addition, deletion, and modification operations using state-of-the-art end-to-end speech editing systems, providing a more realistic benchmark for modern threats. Building upon this, we reformulate SED as a structured text generation task, enabling joint reasoning over edit type identification, and content localization. To enhance the grounding of generative models in acoustic evidence, we propose a prior-enhanced prompting strategy that injects word-level probabilistic cues derived from a frame-level detector. Furthermore, we introduce an acoustic consistency-aware loss that explicitly enforces the separation between normal and anomalous acoustic representations in the latent space. Experimental results demonstrate that the proposed approach consistently outperforms existing methods across both detection and localization tasks.
翻译:现有语音编辑检测(SED)数据集主要依赖人工拼接或有限编辑操作构建,导致场景多样性不足且对现实编辑场景覆盖较差。与此同时,当前SED方法严重依赖帧级监督来检测可观测的声学异常,这从根本上限制了它们处理删除型编辑的能力——在此类操作中,被篡改的内容已从信号中完全消失。为解决这些挑战,我们提出一种统一框架,通过基于音频大语言模型(Audio LLMs)的生成式方法桥接语音编辑检测与内容定位。我们首先引入AiEdit(https://huggingface.co/datasets/JunXueTech/AiEdit),这是一个大规模双语数据集(约140小时),采用最先进的端到端语音编辑系统覆盖插入、删除和修改操作,为现代威胁提供了更真实的基准。在此基础上,我们将SED重构为结构化文本生成任务,实现对编辑类型识别与内容定位的联合推理。为增强生成模型对声学证据的格据能力,我们提出一种先验增强提示策略,注入源自帧级检测器的词级概率线索。此外,我们引入声学一致性感知损失,显式强制在隐空间中分离正常与异常声学表示。实验结果表明,所提方法在检测与定位任务上均持续优于现有方法。