In this paper, we propose the use of self-supervised pretraining on a large unlabelled data set to improve the performance of a personalized voice activity detection (VAD) model in adverse conditions. We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding (APC) framework and fine-tune it for personalized VAD. We also propose a denoising variant of APC, with the goal of improving the robustness of personalized VAD. The trained models are systematically evaluated on both clean speech and speech contaminated by various types of noise at different SNR-levels and compared to a purely supervised model. Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions compared to purely supervised learning.
翻译:在本文中,我们提出利用大规模未标注数据集进行自监督预训练,以提升个性化语音活动检测(VAD)模型在恶劣条件下的性能。我们基于自回归预测编码(APC)框架预训练长短期记忆(LSTM)编码器,并通过微调实现个性化VAD。同时,我们提出了一种APC的去噪变体,旨在增强个性化VAD的鲁棒性。我们在纯净语音及受不同类型噪声污染的不同信噪比(SNR)水平语音上对训练好的模型进行系统评估,并与纯监督模型进行对比。实验表明,自监督预训练不仅提升了纯净条件下的性能,而且相比纯监督学习,还能获得对恶劣条件更具鲁棒性的模型。