Voice interfaces are increasingly used in high stakes domains such as mobile banking, smart home security, and hands free healthcare. Meanwhile, modern generative models have made high quality voice forgeries inexpensive and easy to create, eroding confidence in voice authentication alone. To strengthen protection against such attacks, we present a second authentication factor that combines acoustic evidence with the unique motion patterns of a speaker's lower face. By placing lightweight inertial sensors around the mouth to capture mouth opening and evolving lower facial geometry, our system records a distinct motion signature with strong discriminative power across individuals. We built a prototype and recruited 43 participants to evaluate the system under four conditions seated, walking on level ground, walking on stairs, and speaking with different language backgrounds (native vs. non native English). Across all scenarios, our approach consistently achieved a median equal error rate (EER) of 0.01 or lower, indicating that mouth movement data remain robust under variations in gait, posture, and spoken language. We discuss specific use cases where this second line of defense could provide tangible security benefits to voice authentication systems.
翻译:语音接口在移动银行、智能家居安防及免提医疗等高风险领域应用日益广泛。与此同时,现代生成模型使得高质量语音伪造变得廉价且易于实现,这削弱了单独使用语音认证的可信度。为加强对这类攻击的防护,我们提出一种结合声学证据与说话者下半脸独特运动模式的第二认证因子。通过在嘴部周围部署轻量级惯性传感器以捕捉口部开合及动态下半面部几何形态,本系统能记录具有强个体区分度的独特运动特征。我们构建了原型系统并招募43名参与者,在四种场景下进行评估:静坐、平地行走、上下楼梯以及不同语言背景(英语母语者与非母语者)的说话状态。在所有场景中,我们的方法均稳定实现了中位数等错误率(EER)≤0.01的性能,表明口部运动数据在步态、姿态和语言差异下仍保持鲁棒性。我们进一步探讨了该第二道防线可为语音认证系统带来实际安全增益的具体应用场景。