Good datasets are essential for developing and benchmarking any machine learning system. Their importance is even more extreme for safety critical applications such as deepfake detection - the focus of this paper. Here we reveal that two of the most widely used audio-video deepfake datasets suffer from a previously unidentified spurious feature: the leading silence. Fake videos start with a very brief moment of silence and based on this feature alone, we can separate the real and fake samples almost perfectly. As such, previous audio-only and audio-video models exploit the presence of silence in the fake videos and consequently perform worse when the leading silence is removed. To circumvent latching on such unwanted artifact and possibly other unrevealed ones we propose a shift from supervised to unsupervised learning by training models exclusively on real data. We show that by aligning self-supervised audio-video representations we remove the risk of relying on dataset-specific biases and improve robustness in deepfake detection.
翻译:优质数据集对于开发和评估任何机器学习系统都至关重要。对于深度伪造检测这类安全关键应用而言,数据集的重要性更为突出——这也正是本文的研究重点。本文揭示两个最广泛使用的视听深度伪造数据集存在先前未被识别的伪特征:起始静默段。伪造视频以极短暂的静默开始,仅基于此特征即可近乎完美地区分真实与伪造样本。因此,先前仅依赖音频或结合音频-视频的模型通过利用伪造视频中的静默特征实现检测,当起始静默被移除时其性能显著下降。为规避此类不良伪影及其他潜在未揭示特征的影响,我们提出从监督学习转向无监督学习的范式转换,仅使用真实数据训练模型。研究表明,通过对齐自监督的音频-视频表征,我们能够消除依赖数据集特定偏差的风险,并提升深度伪造检测的鲁棒性。