Listening Deepfake Detection: A New Perspective Beyond Speaking-Centric Forgery Analysis

Existing deepfake detection research has primarily focused on scenarios where the manipulated subject is actively speaking, i.e., generating fabricated content by altering the speaker's appearance or voice. However, in realistic interaction settings, attackers often alternate between falsifying speaking and listening states to mislead their targets, thereby enhancing the realism and persuasiveness of the scenario. Although the detection of 'listening deepfakes' remains largely unexplored and is hindered by a scarcity of both datasets and methodologies, the relatively limited quality of synthesized listening reactions presents an excellent breakthrough opportunity for current deepfake detection efforts. In this paper, we present the task of Listening Deepfake Detection (LDD). We introduce ListenForge, the first dataset specifically designed for this task, constructed using five Listening Head Generation (LHG) methods. To address the distinctive characteristics of listening forgeries, we propose MANet, a Motion-aware and Audio-guided Network that captures subtle motion inconsistencies in listener videos while leveraging speaker's audio semantics to guide cross-modal fusion. Extensive experiments demonstrate that existing Speaking Deepfake Detection (SDD) models perform poorly in listening scenarios. In contrast, MANet achieves significantly superior performance on ListenForge. Our work highlights the necessity of rethinking deepfake detection beyond the traditional speaking-centric paradigm and opens new directions for multimodal forgery analysis in interactive communication settings. The dataset and code are available at https://anonymous.4open.science/r/LDD-B4CB.

翻译：现有深度伪造检测研究主要聚焦于被操控主体主动言说的场景，即通过改变说话者外貌或声音生成伪造内容。然而在现实交互场景中，攻击者常交替伪造言说与聆听状态以误导目标对象，从而增强场景的真实性与说服力。尽管"虚假聆听深度伪造"检测仍处于探索空白阶段，且受限于数据集与方法论的双重匮乏，但合成聆听反应相对有限的质量为当前深度伪造检测研究提供了绝佳突破口。本文提出虚假聆听深度伪造检测（LDD）任务，并构建首个专为此任务设计的数据集ListenForge，该数据集采用五种聆听头部生成（LHG）方法构建。针对聆听伪造的特殊性，我们提出MANet——一种运动感知与音频引导网络，该网络在捕获聆听者视频中细微运动不一致性的同时，利用说话者音频语义引导跨模态融合。大量实验表明，现有言说深度伪造检测（SDD）模型在聆听场景中表现欠佳，而MANet在ListenForge数据集上取得了显著优越的性能。本研究揭示了超越传统言说中心范式重新思考深度伪造检测的必要性，为交互通信场景中的多模态伪造分析开辟了新方向。数据集与代码开源于https://anonymous.4open.science/r/LDD-B4CB。