Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection - the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.
翻译:优质数据集对于开发和评估任何机器学习系统都至关重要。对于深度伪造检测这类安全关键型应用而言,数据集的重要性更为突出——这也正是本文的研究焦点。本文揭示两个最广泛使用的视听深度伪造数据集存在先前未被识别的虚假特征:起始静音段。伪造视频以极短暂的静音开始,仅基于此特征即可近乎完美地区分真实与伪造样本。因此,先前仅依赖音频或结合音频-视频的模型通过利用伪造视频中的静音段实现检测,导致在去除起始静音后性能显著下降。为规避此类不良伪影及其他潜在未发现特征的影响,我们提出从监督学习转向无监督学习的范式转变——仅使用真实数据训练模型。研究表明,通过对齐自监督的视听表征,我们能够消除依赖数据集特定偏差的风险,并显著提升深度伪造检测的鲁棒性。