Freely available and easy-to-use audio editing tools make it straightforward to perform audio splicing. Convincing forgeries can be created by combining various speech samples from the same person. Detection of such splices is important both in the public sector when considering misinformation, and in a legal context to verify the integrity of evidence. Unfortunately, most existing detection algorithms for audio splicing use handcrafted features and make specific assumptions. However, criminal investigators are often faced with audio samples from unconstrained sources with unknown characteristics, which raises the need for more generally applicable methods. With this work, we aim to take a first step towards unconstrained audio splicing detection to address this need. We simulate various attack scenarios in the form of post-processing operations that may disguise splicing. We propose a Transformer sequence-to-sequence (seq2seq) network for splicing detection and localization. Our extensive evaluation shows that the proposed method outperforms existing dedicated approaches for splicing detection [3, 10] as well as the general-purpose networks EfficientNet [28] and RegNet [25].
翻译:自由获取且易于使用的音频编辑工具使得音频拼接操作变得轻而易举。通过组合同一说话者的不同语音样本,可以制造出令人信服的伪造内容。在公共领域应对虚假信息传播,以及在法律场景中验证证据的完整性时,此类拼接检测均具有重要意义。遗憾的是,现有大多数音频拼接检测算法均依赖手工设计特征并做出特定假设。然而,刑事调查人员常常面临来源未知、特征不明的无约束音频样本,这催生了对更普适方法的需求。本研究旨在向解决这一需求迈出第一步,即实现无约束音频拼接检测。我们模拟了多种可能掩盖拼接痕迹的后处理操作攻击场景,并提出了一种基于Transformer的序列到序列(seq2seq)网络用于拼接检测与定位。大量实验表明,所提方法在拼接检测性能上超越了现有专用方法[3,10]以及通用网络EfficientNet[28]和RegNet[25]。