Smart speaker voice assistants (VAs) such as Amazon Echo and Google Home have been widely adopted due to their seamless integration with smart home devices and the Internet of Things (IoT) technologies. These VA services raise privacy concerns, especially due to their access to our speech. This work considers one such use case: the unaccountable and unauthorized surveillance of a user's emotion via speech emotion recognition (SER). This paper presents DARE-GP, a solution that creates additive noise to mask users' emotional information while preserving the transcription-relevant portions of their speech. DARE-GP does this by using a constrained genetic programming approach to learn the spectral frequency traits that depict target users' emotional content, and then generating a universal adversarial audio perturbation that provides this privacy protection. Unlike existing works, DARE-GP provides: a) real-time protection of previously unheard utterances, b) against previously unseen black-box SER classifiers, c) while protecting speech transcription, and d) does so in a realistic, acoustic environment. Further, this evasion is robust against defenses employed by a knowledgeable adversary. The evaluations in this work culminate with acoustic evaluations against two off-the-shelf commercial smart speakers using a small-form-factor (raspberry pi) integrated with a wake-word system to evaluate the efficacy of its real-world, real-time deployment.
翻译:智能音箱语音助手(如Amazon Echo和Google Home)因其与智能家居设备及物联网技术的无缝集成而得到广泛应用。这些语音助手服务引发了隐私担忧,尤其是因为它们能获取我们的语音数据。本研究关注其中一个应用场景:通过语音情感识别对用户情绪进行未经授权且不可问责的监控。本文提出DARE-GP解决方案,通过生成加性噪声来掩蔽用户的情感信息,同时保留语音中与转录相关的部分。该方法采用约束遗传编程方法学习目标用户情感内容的频谱特征,生成通用对抗性音频扰动以实现隐私保护。与现有工作相比,DARE-GP具有以下优势:a) 对未听过的话语提供实时保护,b) 针对未见过的黑盒语音情感识别分类器有效,c) 保护语音转录内容,d) 在真实声学环境中实现部署。此外,该规避方法对知情攻击者采用的防御策略具有鲁棒性。本研究的评估最终在集成唤醒词系统的小型化设备上,针对两款商用智能音箱进行声学评估,验证其在实际场景中实时部署的有效性。