Target-Speaker Voice Activity Detection (TS-VAD) is the task of detecting the presence of speech from a known target-speaker in an audio frame. Recently, deep neural network-based models have shown good performance in this task. However, training these models requires extensive labelled data, which is costly and time-consuming to obtain, particularly if generalization to unseen environments is crucial. To mitigate this, we propose a causal, Self-Supervised Learning (SSL) pretraining framework, called Denoising Autoregressive Predictive Coding (DN-APC), to enhance TS-VAD performance in noisy conditions. We also explore various speaker conditioning methods and evaluate their performance under different noisy conditions. Our experiments show that DN-APC improves performance in noisy conditions, with a general improvement of approx. 2% in both seen and unseen noise. Additionally, we find that FiLM conditioning provides the best overall performance. Representation analysis via tSNE plots reveals robust initial representations of speech and non-speech from pretraining. This underscores the effectiveness of SSL pretraining in improving the robustness and performance of TS-VAD models in noisy environments.
翻译:目标说话人语音活动检测(TS-VAD)的任务是在音频帧中检测已知目标说话人语音的存在。近年来,基于深度神经网络的模型在该任务中表现出良好性能。然而,训练这些模型需要大量标注数据,其获取成本高昂且耗时,特别是在模型泛化至未知环境至关重要的情况下。为缓解此问题,我们提出一种因果性自监督学习(SSL)预训练框架,称为去噪自回归预测编码(DN-APC),以提升噪声环境下TS-VAD的性能。我们还探索了多种说话人条件化方法,并评估了它们在不同噪声条件下的表现。实验表明,DN-APC在噪声条件下提升了性能,在已知与未知噪声中均获得约2%的普遍改进。此外,我们发现FiLM条件化能提供最佳的整体性能。通过tSNE图谱进行的表征分析显示,预训练能生成鲁棒的语音与非语音初始表征。这印证了SSL预训练在提升噪声环境下TS-VAD模型鲁棒性与性能方面的有效性。