Wav2vec2 has achieved success in applying Transformer architecture and self-supervised learning to speech recognition. Recently, these have come to be used not only for speech recognition but also for the entire speech processing. This paper introduces an effective end-to-end speaker identification model applied Transformer-based contextual model. We explored the relationship between the hyper-parameters and the performance in order to discern the structure of an effective model. Furthermore, we propose a pooling method, Temporal Gate Pooling, with powerful learning ability for speaker identification. We applied Conformer as encoder and BEST-RQ for pre-training and conducted an evaluation utilizing the speaker identification of VoxCeleb1. The proposed method has achieved an accuracy of 87.1% with 28.5M parameters, demonstrating comparable precision to wav2vec2 with 317.7M parameters. Code is available at https://github.com/HarunoriKawano/speaker-identification-with-tgp.
翻译:Wav2vec2在将Transformer架构和自监督学习应用于语音识别方面取得了成功。近年来,这些技术不仅被用于语音识别,还延伸至整个语音处理领域。本文提出了一种基于Transformer上下文模型的端到端高效说话人识别模型。我们探索了超参数与性能之间的关系,以揭示有效模型的结构规律。此外,我们提出了一种具有强大学习能力的池化方法——时序门池化,专门用于说话人识别。我们采用Conformer作为编码器,并使用BEST-RQ进行预训练,基于VoxCeleb1的说话人识别任务开展了评估。所提出的方法仅用28.5M参数即达到了87.1%的准确率,展现出与317.7M参数的wav2vec2相当的精度。代码开源地址:https://github.com/HarunoriKawano/speaker-identification-with-tgp。