Eve Said Yes: AirBone Authentication for Head-Wearable Smart Voice Assistant

Recent advances in machine learning and natural language processing have fostered the enormous prosperity of smart voice assistants and their services, e.g., Alexa, Google Home, Siri, etc. However, voice spoofing attacks are deemed to be one of the major challenges of voice control security, and never stop evolving such as deep-learning-based voice conversion and speech synthesis techniques. To solve this problem outside the acoustic domain, we focus on head-wearable devices, such as earbuds and virtual reality (VR) headsets, which are feasible to continuously monitor the bone-conducted voice in the vibration domain. Specifically, we identify that air and bone conduction (AC/BC) from the same vocalization are coupled (or concurrent) and user-level unique, which makes them suitable behavior and biometric factors for multi-factor authentication (MFA). The legitimate user can defeat acoustic domain and even cross-domain spoofing samples with the proposed two-stage AirBone authentication. The first stage answers \textit{whether air and bone conduction utterances are time domain consistent (TC)} and the second stage runs \textit{bone conduction speaker recognition (BC-SR)}. The security level is hence increased for two reasons: (1) current acoustic attacks on smart voice assistants cannot affect bone conduction, which is in the vibration domain; (2) even for advanced cross-domain attacks, the unique bone conduction features can detect adversary's impersonation and machine-induced vibration. Finally, AirBone authentication has good usability (the same level as voice authentication) compared with traditional MFA and those specially designed to enhance smart voice security. Our experimental results show that the proposed AirBone authentication is usable and secure, and can be easily equipped by commercial off-the-shelf head wearables with good user experience.

翻译：机器学习与自然语言处理的最新进展极大促进了智能语音助手及其服务（如Alexa、Google Home、Siri等）的蓬勃发展。然而，语音欺骗攻击被认为是语音控制安全的主要挑战之一，且从未停止演变，例如基于深度学习的语音转换和语音合成技术。为解决声学域之外的这一问题，我们聚焦于耳机、虚拟现实头盔等头戴式设备——这些设备能够持续监测振动域中的骨传导语音。具体而言，我们发现同一发声产生的空气传导与骨传导存在耦合（或并发）性，且具有用户级唯一性，使其适用于多因子认证中的行为因子与生物因子。通过所提出的两阶段空骨传导认证，合法用户能够抵御声学域甚至跨域欺骗样本：第一阶段验证空气与骨传导语音是否在时域上一致，第二阶段执行骨传导说话人识别。安全性因此提升的两大原因是：（1）当前针对智能语音助手的声学攻击无法影响处于振动域的骨传导；（2）即使是先进的跨域攻击，独特的骨传导特征也能检测攻击者的冒充行为及机器引发的振动。最终，相比传统多因子认证及专为增强智能语音安全性设计的方法，空骨传导认证具有良好的可用性（与语音认证相当）。实验结果表明，所提出的空骨传导认证兼具可用性与安全性，且能轻松集成于商用现成头戴式设备中，提供良好的用户体验。