Deepfakes are increasingly realistic and easy to produce, raising concerns about the reliability of human judgments in misinformation settings. We study audiovisual deepfake detection by measuring how consistently crowd workers distinguish authentic from manipulated videos and, when they flag a video as manipulated, how accurately they identify the manipulation type (audio-only, video-only, or audio-video) and how consistently they report manipulation timestamps. We run two matched crowdsourcing studies on Prolific using AV-Deepfake1M and the Trusted Media Challenge (TMC) dataset. We sample 48 videos per dataset (96 total) and collect 960 judgments (10 per video). Results show that crowd workers rarely misclassify authentic videos as manipulated, but they miss many manipulations, and agreement remains limited across videos. Aggregating multiple judgments per video stabilizes the authenticity signal, but it cannot recover manipulations that most workers consistently miss. Manipulation type identification is substantially noisier than authenticity detection even when workers detect a manipulation, with joint audio-video cases being particularly hard to recognize. Overall, these findings suggest that crowdsourcing can provide a scalable screening signal for audiovisual authenticity, while reliable modality attribution remains an open challenge.
翻译:深度伪造技术日益逼真且易于制作,引发了对人们在虚假信息环境中判断可靠性的担忧。我们通过测量众包工人在区分真实与篡改视频时的一致性,以及当他们标记视频为篡改时,准确识别篡改类型(仅音频、仅视频或音视频结合)及报告篡改时间戳的一致性,来研究视听深度伪造的检测能力。我们利用AV-Deepfake1M和Trusted Media Challenge(TMC)数据集,在Prolific平台上开展了两项匹配的众包研究。每个数据集采样48个视频(共96个),并收集了960次判断(每个视频10次)。结果表明,众包工人很少将真实视频误判为篡改,但会遗漏大量篡改,且不同视频间的一致性有限。对每个视频的多次判断进行聚合可以稳定真实性信号,但无法恢复大多数工人一致遗漏的篡改。即使工人检测到篡改,篡改类型识别的噪声也远高于真实性检测,特别是音视频联合篡改案例难以识别。总体而言,这些发现表明,众包可为视听真实性提供可扩展的筛查信号,而可靠的模态归因仍是一项开放性挑战。