Wav2vec2 has achieved success in applying Transformer architecture and self-supervised learning to speech recognition. Recently, these have come to be used not only for speech recognition but also for the entire speech processing. This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model. We explored the relationship between the parameters and the performance in order to discern the structure of an effective model. Furthermore, we propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification. We applied Conformer as encoder and BEST-RQ for pre-training and conducted an evaluation utilizing the speaker identification of VoxCeleb1. The proposed method has achieved an accuracy of 85.9% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.
翻译:Wav2vec2成功地将Transformer架构与自监督学习应用于语音识别领域。近期,这些方法不仅用于语音识别,更扩展至整个语音处理领域。本文提出了一种基于Transformer上下文模型的高效端到端说话人识别模型。为探究有效模型的结构,我们深入分析了参数与性能之间的关系。此外,我们提出了一种具有强大学习能力的池化方法——时间门控池化(Temporal Gate Pooling),专门用于说话人识别。采用Conformer作为编码器、BEST-RQ进行预训练,并在VoxCeleb1说话人识别任务上进行评估。所提方法在28.5M参数规模下达到85.9%的准确率,展现出与317.7M参数的wav2vec2相当的精度。代码已开源:https://github.com/HarunoriKawano/speaker-identification-with-tgp