Existing lip-sync deepfake detectors rely on pixel artifacts or audio-visual correspondence, and both fail under generator or language shift because the features they learn are tied to the training distribution. We take a different approach. Authentic lip motion is constrained by tissue mechanics and neuromuscular bandwidth; current generators typically do not impose these constraints, producing trajectories with elevated variance in velocity, acceleration, and jerk that real speech does not exhibit. We exploit this signal, which we term temporal lip jitter, by computing kinematic statistics from 64 perioral landmarks over short sliding windows and feeding them into a lightweight three-branch network. The model uses only landmark coordinates: no pixels, no audio, and no voiceprint data. We train only on English data and test in a zero-shot setting on five unseen generators and seven languages.
翻译:现有唇语同步深度伪造检测器依赖于像素伪影或音视频对应关系,但在生成器迁移或语言迁移场景下均会失效,因为其所学习的特征与训练分布紧密绑定。本文采用不同方法:真实唇部运动受组织力学与神经肌肉带宽约束,而当前生成器通常未施加此类约束,导致产生的运动轨迹在速度、加速度及加加速度上呈现真实语音中不存在的显著方差。我们利用这一信号(称为时间唇部抖动),通过从64个口周标志点提取短滑动窗口内的运动学统计量,并将其输入轻量级三分支网络。该模型仅使用标志点坐标:不依赖像素、音频或声纹数据。我们仅以英语数据进行训练,在零样本设置下对五种未见生成器及七种语言进行测试。