Speaker identification in multilingual settings presents unique challenges, particularly when conventional models are predominantly trained on English data. In this paper, we propose WSI (Whisper Speaker Identification), a framework that repurposes the encoder of the Whisper automatic speech recognition model pre trained on extensive multilingual data to generate robust speaker embeddings via a joint loss optimization strategy that leverages online hard triplet mining and self supervised Normalized Temperature-scaled Cross Entropy loss. By capitalizing on Whisper language-agnostic acoustic representations, our approach effectively distinguishes speakers across diverse languages and recording conditions. Extensive evaluations on multiple corpora, including VoxTube (multilingual), JVS (Japanese), CallHome (German, Spanish, Chinese, and Japanese), and Voxconverse (English), demonstrate that WSI consistently outperforms state-of-the-art baselines, namely Pyannote Embedding, ECAPA TDNN, and Xvector, in terms of lower equal error rates and higher AUC scores. These results validate our hypothesis that a multilingual pre-trained ASR encoder, combined with joint loss optimization, substantially improves speaker identification performance in non-English languages.
翻译:多语言环境下的说话人识别面临独特挑战,尤其在传统模型主要基于英语数据训练的背景下。本文提出WSI(Whisper说话人识别)框架,该框架通过联合损失优化策略——融合在线困难三元组挖掘与自监督归一化温度缩放交叉熵损失——重新利用基于大规模多语言数据预训练的Whisper自动语音识别模型的编码器,以生成鲁棒的说话人嵌入。通过利用Whisper语言无关的声学表征,我们的方法能有效区分不同语言及录音条件下的说话人。在多个语料库(包括VoxTube(多语言)、JVS(日语)、CallHome(德语、西班牙语、汉语和日语)及Voxconverse(英语))上的广泛评估表明,WSI在更低等错误率与更高AUC分数方面持续优于当前最先进的基线模型(即Pyannote Embedding、ECAPA TDNN和Xvector)。这些结果验证了我们的假设:多语言预训练ASR编码器与联合损失优化相结合,能显著提升非英语语言的说话人识别性能。