It is widely acknowledged that discriminative representation for speaker verification can be extracted from verbal speech. However, how much speaker information that non-verbal vocalization carries is still a puzzle. This paper explores speaker verification based on the most ubiquitous form of non-verbal voice, laughter. First, we use a semi-automatic pipeline to collect a new Haha-Pod dataset from open-source podcast media. The dataset contains over 240 speakers' laughter clips with corresponding high-quality verbal speech. Second, we propose a Two-Stage Teacher-Student (2S-TS) framework to minimize the within-speaker embedding distance between verbal and non-verbal (laughter) signals. Considering Haha-Pod as a test set, two trials (S2L-Eval) are designed to verify the speaker's identity through laugh sounds. Experimental results demonstrate that our method can significantly improve the performance of the S2L-Eval test set with only a minor degradation on the VoxCeleb1 test set. The resources for the Haha-Pod dataset can be found at https://github.com/nevermoreLin/HahaPod.
翻译:众所周知,从言语语音中可以提取出用于说话人验证的判别性表征。然而,非言语发声携带了多少说话人信息仍是一个谜题。本文探讨了基于最常见非言语发声形式——笑声的说话人验证。首先,我们采用半自动流水线从开源播客媒体中收集了新的Haha-Pod数据集。该数据集包含超过240位说话人的笑声片段及其对应的高质量言语语音。其次,我们提出一种两阶段师生(2S-TS)框架,以最小化言语与非言语(笑声)信号之间的说话人内部嵌入距离。将Haha-Pod作为测试集,设计了两项试验(S2L-Eval)通过笑声声音验证说话人身份。实验结果表明,我们的方法能在仅轻微降低VoxCeleb1测试集性能的情况下,显著提升S2L-Eval测试集的性能。Haha-Pod数据集的资源可在https://github.com/nevermoreLin/HahaPod 获取。