Whisper-SV: Adapting Whisper for Low-data-resource Speaker Verification

Trained on 680,000 hours of massive speech data, Whisper is a multitasking, multilingual speech foundation model demonstrating superior performance in automatic speech recognition, translation, and language identification. However, its applicability in speaker verification (SV) tasks remains unexplored, particularly in low-data-resource scenarios where labeled speaker data in specific domains are limited. To fill this gap, we propose a lightweight adaptor framework to boost SV with Whisper, namely Whisper-SV. Given that Whisper is not specifically optimized for SV tasks, we introduce a representation selection module to quantify the speaker-specific characteristics contained in each layer of Whisper and select the top-k layers with prominent discriminative speaker features. To aggregate pivotal speaker-related features while diminishing non-speaker redundancies across the selected top-k distinct layers of Whisper, we design a multi-layer aggregation module in Whisper-SV to integrate multi-layer representations into a singular, compacted representation for SV. In the multi-layer aggregation module, we employ convolutional layers with shortcut connections among different layers to refine speaker characteristics derived from multi-layer representations from Whisper. In addition, an attention aggregation layer is used to reduce non-speaker interference and amplify speaker-specific cues for SV tasks. Finally, a simple classification module is used for speaker classification. Experiments on VoxCeleb1, FFSVC, and IMSV datasets demonstrate that Whisper-SV achieves EER/minDCF of 2.22%/0.307, 6.14%/0.488, and 7.50%/0.582, respectively, showing superior performance in low-data-resource SV scenarios.

翻译：Whisper是一种基于68万小时海量语音数据训练的多任务、多语言语音基础模型，在自动语音识别、翻译和语言识别任务中展现出卓越性能。然而，其在说话人验证任务中的应用潜力尚未被充分探索，尤其是在特定领域标注说话人数据有限的低数据资源场景中。为填补这一空白，本文提出一种轻量级适配器框架Whisper-SV，以增强Whisper在说话人验证任务中的性能。鉴于Whisper并非专门针对说话人验证任务优化，我们设计了表征选择模块来量化Whisper各层所包含的说话人特异性特征，并筛选出具有显著判别性说话人特征的前k个层级。为聚合关键说话人相关特征并消除所选前k个异构层级中的非说话人冗余信息，Whisper-SV采用多层聚合模块将多层表征整合为单一紧凑的表征用于说话人验证。该模块通过具有跨层快捷连接的卷积层来精炼从Whisper多层表征中提取的说话人特征，同时引入注意力聚合层以抑制非说话人干扰并增强说话人特异性线索。最后通过简单的分类模块完成说话人分类。在VoxCeleb1、FFSVC和IMSV数据集上的实验表明，Whisper-SV分别取得了2.22%/0.307、6.14%/0.488和7.50%/0.582的等错误率/最小检测代价函数值，在低数据资源说话人验证场景中展现出优越性能。