Freely available and easy-to-use audio editing tools make it straightforward to perform audio splicing. Convincing forgeries can be created by combining various speech samples from the same person. Detection of such splices is important both in the public sector when considering misinformation, and in a legal context to verify the integrity of evidence. Unfortunately, most existing detection algorithms for audio splicing use handcrafted features and make specific assumptions. However, criminal investigators are often faced with audio samples from unconstrained sources with unknown characteristics, which raises the need for more generally applicable methods. With this work, we aim to take a first step towards unconstrained audio splicing detection to address this need. We simulate various attack scenarios in the form of post-processing operations that may disguise splicing. We propose a Transformer sequence-to-sequence (seq2seq) network for splicing detection and localization. Our extensive evaluation shows that the proposed method outperforms existing dedicated approaches for splicing detection [3, 10] as well as the general-purpose networks EfficientNet [28] and RegNet [25].
翻译:免费且易用的音频编辑工具使得音频拼接操作变得简单易行。通过组合同一说话人的不同语音样本,可以制造出令人信服的伪造音频。此类拼接检测在公共领域应对虚假信息传播以及法律语境下验证证据完整性方面具有重要意义。遗憾的是,现有大多数音频拼接检测算法依赖手工特征并基于特定假设。然而,刑事调查人员常面临来自无约束来源且特征未知的音频样本,这亟需更具普适性的检测方法。本研究旨在向无约束音频拼接检测迈出第一步以应对这一需求。我们模拟了多种可能掩盖拼接痕迹的后处理操作攻击场景,提出了一种基于Transformer的序列到序列(seq2seq)网络用于拼接检测与定位。广泛评估表明,所提方法在拼接检测性能上不仅优于现有专用方法[3,10],还超越了通用网络EfficientNet[28]和RegNet[25]。