It is widely acknowledged that discriminative representation for speaker verification can be extracted from verbal speech. However, how much speaker information that non-verbal vocalization carries is still a puzzle. This paper explores speaker verification based on the most ubiquitous form of non-verbal voice, laughter. First, we use a semi-automatic pipeline to collect a new Haha-Pod dataset from open-source podcast media. The dataset contains over 240 speakers' laughter clips with corresponding high-quality verbal speech. Second, we propose a Two-Stage Teacher-Student (2S-TS) framework to minimize the within-speaker embedding distance between verbal and non-verbal (laughter) signals. Considering Haha-Pod as a test set, two trials (S2L-Eval) are designed to verify the speaker's identity through laugh sounds. Experimental results demonstrate that our method can significantly improve the performance of the S2L-Eval test set with only a minor degradation on the VoxCeleb1 test set. The Haha-Pod dataset is open to access on https://drive.google.com/file/d/1J-HBRTsm_yWrcbkXupy-tiWRt5gE2LzG/view?usp=drive_link.
翻译:众所周知,说话人验证的判别性表征可从言语语音中提取。然而,非语言发声承载了多少说话人信息仍是一个谜题。本文探索基于最普遍的非语言声音形式——笑声的说话人验证。首先,我们采用半自动流程从开源播客媒体中收集新的Haha-Pod数据集,该数据集包含240多位说话人的笑声片段及对应的高质量言语语音。其次,我们提出两阶段师生(2S-TS)框架以最小化言语与非言语(笑声)信号间的说话人内部嵌入距离。将Haha-Pod作为测试集,设计了两种测试方案(S2L-Eval),通过笑声验证说话人身份。实验结果表明,我们的方法能在显著提升S2L-Eval测试集性能的同时,仅在VoxCeleb1测试集上造成微小性能下降。Haha-Pod数据集可通过https://drive.google.com/file/d/1J-HBRTsm_yWrcbkXupy-tiWRt5gE2LzG/view?usp=drive_link 公开访问。