Speech is a common input method for mobile embedded devices, but cloud-based speech recognition systems pose privacy risks. Disentanglement-based encoders, designed to safeguard user privacy by filtering sensitive information from speech signals, unfortunately require substantial memory and computational resources, which limits their use in less powerful devices. To overcome this, we introduce a novel system, XXX, optimized for such devices. XXX is built on the insight that speech understanding primarily relies on understanding the entire utterance's long-term dependencies, while privacy concerns are often linked to short-term details. Therefore, XXX focuses on selectively masking these short-term elements, preserving the quality of long-term speech understanding. The core of XXX is an innovative differential mask generator, grounded in interpretable learning, which fine-tunes the masking process. We tested XXX on the STM32H7 microcontroller, assessing its performance in various potential attack scenarios. The results show that XXX maintains speech understanding accuracy and privacy at levels comparable to existing encoders, but with a significant improvement in efficiency, achieving up to 53.3$\times$ faster processing and a 134.1$\times$ smaller memory footprint.
翻译:语音是移动嵌入式设备的常见输入方式,但基于云端的语音识别系统存在隐私风险。为保障用户隐私而设计的基于解耦的编码器,通过从语音信号中过滤敏感信息,却需要大量的内存和计算资源,限制了其在低性能设备上的应用。为此,我们提出了一种针对此类设备优化的新型系统XXX。XXX基于以下洞见构建:语音理解主要依赖对整个话语长时依赖关系的理解,而隐私问题通常与短时细节相关。因此,XXX专注于选择性屏蔽这些短时成分,在保持长时语音理解质量的同时。XXX的核心是一种基于可解释学习的创新差分掩码生成器,用于精细调节掩码过程。我们在STM32H7微控制器上测试了XXX,评估了其在多种潜在攻击场景下的性能。结果表明,XXX在语音理解准确率和隐私保护方面与现有编码器相当,但在效率上实现了显著提升——处理速度最高提升53.3倍,内存占用降低134.1倍。