Recent advances in voice synthesis, coupled with the ease with which speech can be harvested for millions of people, introduce new threats to applications that are enabled by devices such as voice assistants (e.g., Amazon Alexa, Google Home etc.). We explore if unrelated and limited amount of speech from a target can be used to synthesize commands for a voice assistant like Amazon Alexa. More specifically, we investigate attacks on voice assistants with synthetic commands when they match command sources to authorized users, and applications (e.g., Alexa Skills) process commands only when their source is an authorized user with a chosen confidence level. We demonstrate that even simple concatenative speech synthesis can be used by an attacker to command voice assistants to perform sensitive operations. We also show that such attacks, when launched by exploiting compromised devices in the vicinity of voice assistants, can have relatively small host and network footprint. Our results demonstrate the need for better defenses against synthetic malicious commands that could target voice assistants.
翻译:近年来,语音合成技术的进步,加之能够轻易获取数百万人的语音数据,为语音助手(如Amazon Alexa、Google Home等)设备所支持的应用带来了新的威胁。本文探讨了是否可以利用目标对象少量且不相关的语音样本来合成针对Amazon Alexa等语音助手的指令。具体而言,我们研究了在语音助手将指令来源与授权用户进行匹配、且应用程序(如Alexa Skills)仅在指令来源为授权用户并达到选定置信度时才处理指令的情况下,利用合成语音指令发起的攻击。我们证明,攻击者甚至可以使用简单的拼接式语音合成技术来操控语音助手执行敏感操作。我们还表明,当此类攻击通过利用语音助手附近被入侵的设备发起时,其主机和网络足迹可以相对较小。我们的研究结果表明,有必要针对可能针对语音助手的恶意合成指令开发更有效的防御机制。