Existing lip-sync deepfake detectors rely on pixel artifacts or audio-visual correspondence, and both fail under generator or language shift because the features they learn are tied to the training distribution. We take a different approach. Real lip motion is constrained by tissue mechanics and neuromuscular bandwidth; current generators impose none of these constraints, producing trajectories with elevated variance in velocity, acceleration, and jerk that real speech does not exhibit. We exploit this as a detection signal temporal lip jitter, by computing displacement, velocity, acceleration, and jerk statistics from 64 perioral landmarks over 25-frame windows and feeding them into a lightweight three-branch network. The model uses only landmark coordinates: no pixels, no audio, and no voiceprint data.
翻译:暂无翻译