The success of automatic speaker verification shows that discriminative speaker representations can be extracted from neutral speech. However, as a kind of non-verbal voice, laughter should also carry speaker information intuitively. Thus, this paper focuses on exploring speaker verification about utterances containing non-verbal laughter segments. We collect a set of clips with laughter components by conducting a laughter detection script on VoxCeleb and part of the CN-Celeb dataset. To further filter untrusted clips, probability scores are calculated by our binary laughter detection classifier, which is pre-trained by pure laughter and neutral speech. After that, based on the clips whose scores are over the threshold, we construct trials under two different evaluation scenarios: Laughter-Laughter (LL) and Speech-Laughter (SL). Then a novel method called Laughter-Splicing based Network (LSN) is proposed, which can significantly boost performance in both scenarios and maintain the performance on the neutral speech, such as the VoxCeleb1 test set. Specifically, our system achieves relative 20% and 22% improvement on Laughter-Laughter and Speech-Laughter trials, respectively. The meta-data and sample clips have been released at https://github.com/nevermoreLin/Laugh_LSN.
翻译:自动说话人验证的成功表明,从中性语音中可以提取出具有判别性的说话人表征。然而,作为一种非语言声音,笑声在直觉上同样应携带说话人信息。为此,本文聚焦于探索包含非语言笑声片段的语音在说话人验证中的表现。我们通过在大规模数据集VoxCeleb和部分CN-Celeb数据集上运行笑声检测脚本,收集了一批包含笑声成分的音频片段。为进一步筛选不可信片段,我们利用预训练的二元笑声检测分类器(该分类器由纯笑声和中立语音训练而成)计算概率分数。随后,基于得分超过阈值的片段,我们构建了两种不同评估场景下的测试集:笑声-笑声(LL)和语音-笑声(SL)。在此基础上,提出一种名为基于笑声拼接网络(LSN,Laughter-Splicing based Network)的新型方法,该方法能在两种场景下显著提升性能,并保持原始中性语音(如VoxCeleb1测试集)上的表现。具体而言,我们的系统在笑声-笑声和语音-笑声测试集上分别实现了相对20%和22%的性能提升。相关元数据和示例片段已在https://github.com/nevermoreLin/Laugh_LSN 开源发布。