Speaker individuality information is among the most critical elements within speech signals. By thoroughly and accurately modeling this information, it can be utilized in various intelligent speech applications, such as speaker recognition, speaker diarization, speech synthesis, and target speaker extraction. In this overview, we present a comprehensive review of neural approaches to speaker representation learning from both theoretical and practical perspectives. Theoretically, we discuss speaker encoders ranging from supervised to self-supervised learning algorithms, standalone models to large pretrained models, pure speaker embedding learning to joint optimization with downstream tasks, and efforts toward interpretability. Practically, we systematically examine approaches for robustness and effectiveness, introduce and compare various open-source toolkits in the field. Through the systematic and comprehensive review of the relevant literature, research activities, and resources, we provide a clear reference for researchers in the speaker characterization and modeling field, as well as for those who wish to apply speaker modeling techniques to specific downstream tasks.
翻译:说话人个性信息是语音信号中最关键的要素之一。通过对该信息进行深入而精确的建模,可将其应用于多种智能语音应用,如说话人识别、说话人日志、语音合成及目标说话人提取。本综述从理论和实践两个角度,对说话人表征学习的神经方法进行了全面回顾。理论上,我们讨论了从监督学习到自监督学习算法、独立模型到大规模预训练模型、纯说话人嵌入学习到与下游任务联合优化,以及面向可解释性的努力等各类说话人编码器。实践上,我们系统性地考察了面向鲁棒性和有效性的方法,介绍并比较了该领域的多种开源工具包。通过对相关文献、研究活动和资源的系统性与全面性回顾,我们为说话人表征与建模领域的研究者,以及希望将说话人建模技术应用于特定下游任务的研究人员,提供了清晰的参考。