Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce DELULU, a speaker-aware self-supervised foundational model that addresses this limitation by integrating external supervision into the pseudo-label generation process. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide the k-means clustering step during pre-training, introducing a strong speaker-discriminative inductive bias that aligns representation learning with speaker identity. The model is trained using a dual objective that combines masked prediction and denoising, further enhancing robustness and generalization. DELULU significantly outperforms prior self-supervised learning (SSL) models across a range of speaker-centric tasks, achieving up to 62% relative improvement in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks such as gender, age, accent, and speaker counting. Our findings demonstrate that DELULU is a strong universal encoder for speaker-aware speech processing, enabling superior performance even without task-specific fine-tuning.
翻译:自监督语音模型在内容驱动的任务上取得了显著成功,但在捕获对于验证、日记化和画像应用至关重要的说话人判别性特征方面仍然有限。我们提出了DELULU,一个说话人感知的自监督基础模型,它通过将外部监督集成到伪标签生成过程中来解决这一局限性。DELULU利用来自最先进的说话人验证模型ReDimNet的帧级嵌入,在预训练期间指导k-means聚类步骤,从而引入一个强大的说话人判别性归纳偏置,使表示学习与说话人身份对齐。该模型采用结合掩码预测和去噪的双重目标进行训练,进一步增强了鲁棒性和泛化能力。DELULU在一系列以说话人为中心的任务上显著优于先前的自监督学习(SSL)模型,在说话人验证中实现了高达62%的等错误率(EER)相对提升,并在性别、年龄、口音和说话人计数等零样本画像任务上取得了一致的增益。我们的研究结果表明,DELULU是一个强大的通用编码器,适用于说话人感知的语音处理,即使在没有任务特定微调的情况下也能实现卓越性能。