Spoken keyword spotting (KWS) is the task of identifying a keyword in an audio stream and is widely used in smart devices at the edge in order to activate voice assistants and perform hands-free tasks. The task is daunting as there is a need, on the one hand, to achieve high accuracy while at the same time ensuring that such systems continue to run efficiently on low power and possibly limited computational capabilities devices. This work presents AraSpot for Arabic keyword spotting trained on 40 Arabic keywords, using different online data augmentation, and introducing ConformerGRU model architecture. Finally, we further improve the performance of the model by training a text-to-speech model for synthetic data generation. AraSpot achieved a State-of-the-Art SOTA 99.59% result outperforming previous approaches.
翻译:语音关键词识别(KWS)是指在音频流中识别关键词的任务,广泛应用于边缘智能设备中,以激活语音助手并执行免提操作。该任务颇具挑战性,因为一方面需要实现高精度,同时又要确保此类系统能够在低功耗且计算能力可能受限的设备上高效运行。本文提出AraSpot阿拉伯语关键词识别系统,该系统基于40个阿拉伯语关键词进行训练,采用多种在线数据增强方法,并引入ConformerGRU模型架构。最后,我们通过训练文本转语音模型生成合成数据,进一步提升了模型性能。AraSpot取得了99.59%的最新(SOTA)结果,优于先前方法。