Voice activity detection (VAD) improves the performance of speaker verification (SV) by preserving speech segments and attenuating the effects of non-speech. However, this scheme is not ideal: (1) it fails in noisy environments or multi-speaker conversations; (2) it is trained based on inaccurate non-SV sensitive labels. To address this, we propose a speaker verification-based voice activity detection (SVVAD) framework that can adapt the speech features according to which are most informative for SV. To achieve this, we introduce a label-free training method with triplet-like losses that completely avoids the performance degradation of SV due to incorrect labeling. Extensive experiments show that SVVAD significantly outperforms the baseline in terms of equal error rate (EER) under conditions where other speakers are mixed at different ratios. Moreover, the decision boundaries reveal the importance of the different parts of speech, which are largely consistent with human judgments.
翻译:语音活动检测(VAD)通过保留语音片段并削弱非语音的影响,提升了说话人确认(SV)的性能。然而,该方案存在以下不足:(1)在嘈杂环境或多说话人对话中失效;(2)基于不准确的非SV敏感标签进行训练。针对这一问题,本文提出一种基于说话人确认的语音活动检测(SVVAD)框架,该框架可根据对SV最具信息量的语音特征自适应调整。为此,我们引入了一种采用三元组损失的免标签训练方法,完全避免了因错误标注导致的SV性能下降。大量实验表明,在不同说话人混合比例条件下,SVVAD在等错误率(EER)指标上显著优于基准方法。此外,决策边界揭示了语音不同部分的重要性,这与人类判断基本一致。