Audio deepfakes have improved rapidly recently, yet their effect on human trust in real speech remains unstudied. We present the largest listening study on audio deepfake perception to date, collecting 35,532 judgments from 1,768 participants across 138 text-to-speech and voice conversion systems. Our central finding is a skepticism shift: compared to a 2021 baseline, human accuracy on fake samples barely changed (72.9% to 71.2%), but accuracy on real samples dropped from 72.7% to 64.1%. Participants are not worse at detecting synthesis artifacts; rather, they increasingly distrust authentic speech. Samples generated by commercial and autoregressive language model systems proved hardest to detect (61.3 - 65.9%), while those from traditional seq2seq and flow-matching models remain easier to spot (75.4 - 76.8%). An ML detector that served as a reference point maintained over 94.5% accuracy across all conditions. Our results suggest that the primary threat posed by modern deepfakes may not be mere deception, but the erosion of trust in genuine audio.
翻译:音频深度伪造技术近年来快速进步,但其对人类真实语音信任度的影响尚未得到研究。我们开展了迄今为止规模最大的音频深度伪造感知听力研究,收集了来自1768名参与者在138个文本转语音和语音转换系统上的35532项判断。核心发现是"怀疑偏移"现象:与2021年基线相比,人类对伪造样本的准确率几乎未变(从72.9%降至71.2%),但对真实样本的准确率从72.7%骤降至64.1%。参与者并非更难以检测合成伪影,而是日益不信任真实语音。商用自回归语言模型系统生成的样本最难检测(61.3-65.9%),而传统序列到序列和流匹配模型生成的样本仍较易识别(75.4-76.8%)。作为参考基准的机器学习检测器在所有条件下保持94.5%以上的准确率。我们的结果表明,现代深度伪造的主要威胁可能并非简单欺骗,而是对真实音频的信任侵蚀。