As technology advances and digital devices become prevalent, seamless human-machine communication is increasingly gaining significance. The growing adoption of mobile, wearable, and other Internet of Things (IoT) devices has changed how we interact with these smart devices, making accurate spoken words recognition a crucial component for effective interaction. However, building robust spoken words detection system that can handle novel keywords remains challenging, especially for low-resource languages with limited training data. Here, we propose PLiX, a multilingual and plug-and-play keyword spotting system that leverages few-shot learning to harness massive real-world data and enable the recognition of unseen spoken words at test-time. Our few-shot deep models are learned with millions of one-second audio clips across 20 languages, achieving state-of-the-art performance while being highly efficient. Extensive evaluations show that PLiX can generalize to novel spoken words given as few as just one support example and performs well on unseen languages out of the box. We release models and inference code to serve as a foundation for future research and voice-enabled user interface development for emerging devices.
翻译:随着技术的进步和数字设备的普及,人机无缝通信日益重要。移动设备、可穿戴设备及其他物联网设备的广泛采用改变了我们与这些智能设备的交互方式,使得准确的口语词识别成为有效交互的关键组成部分。然而,构建能够处理新关键词的鲁棒口语词检测系统仍然具有挑战性,尤其是针对训练数据有限的低资源语言。本文提出PLiX,一种多语言即插即用的关键词发现系统,利用少样本学习利用海量真实世界数据,实现测试时对未见口语词的识别。我们的少样本深度模型在20种语言的数百万个一秒音频片段上学习,实现了最先进性能,同时高效运行。大量评估表明,PLiX能够泛化到仅有一个支持示例的新颖口语词,并开箱即用地在未见语言上表现良好。我们发布模型和推理代码,以作为未来研究及新兴设备语音用户界面开发的基础。