Self-supervised learning method that provides generalized speech representations has recently received increasing attention. Wav2vec 2.0 is the most famous example, showing remarkable performance in numerous downstream speech processing tasks. Despite its success, it is challenging to use it directly for wake-up word detection on mobile devices due to its expensive computational cost. In this work, we propose LiteFEW, a lightweight feature encoder for wake-up word detection that preserves the inherent ability of wav2vec 2.0 with a minimum scale. In the method, the knowledge of the pre-trained wav2vec 2.0 is compressed by introducing an auto-encoder-based dimensionality reduction technique and distilled to LiteFEW. Experimental results on the open-source "Hey Snips" dataset show that the proposed method applied to various model structures significantly improves the performance, achieving over 20% of relative improvements with only 64k parameters.
翻译:提供通用化语音表征的自监督学习方法近年来受到日益广泛的关注。Wav2vec 2.0作为其中最著名的代表,在众多下游语音处理任务中展现出卓越的性能。尽管其表现优异,但由于高昂的计算成本,该模型难以直接应用于移动设备上的唤醒词检测任务。本研究提出LiteFEW——一种面向唤醒词检测的轻量级特征编码器,该模型在最小化参数规模的同时保留了wav2vec 2.0的固有表征能力。该方法通过引入基于自编码器的降维技术压缩预训练wav2vec 2.0的知识,并将其蒸馏至LiteFEW。在开源数据集"Hey Snips"上的实验结果表明,将所提方法应用于多种模型结构均能显著提升性能,在仅64k参数量的前提下实现了超过20%的相对性能提升。