In this paper, we propose the use of self-supervised pretraining on a large unlabelled data set to improve the performance of a personalized voice activity detection (VAD) model in adverse conditions. We pretrain a long short-term memory (LSTM)-encoder using the autoregressive predictive coding (APC) framework and fine-tune it for personalized VAD. We also propose a denoising variant of APC, with the goal of improving the robustness of personalized VAD. The trained models are systematically evaluated on both clean speech and speech contaminated by various types of noise at different SNR-levels and compared to a purely supervised model. Our experiments show that self-supervised pretraining not only improves performance in clean conditions, but also yields models which are more robust to adverse conditions compared to purely supervised learning.
翻译:本文提出利用大规模无标注数据集进行自监督预训练,以提升个性化语音活动检测(VAD)模型在恶劣环境下的性能。我们采用自回归预测编码(APC)框架预训练长短期记忆(LSTM)编码器,并针对个性化VAD进行微调。同时,我们提出APC的去噪变体,旨在增强个性化VAD的鲁棒性。在纯净语音及不同信噪比(SNR)下受多种噪声污染的语音上,对训练模型进行系统评估,并与纯监督模型进行比较。实验表明,自监督预训练不仅提升了纯净条件下的性能,而且相比纯监督学习,所获模型对恶劣环境具有更强的鲁棒性。