Self-supervised speech models have achieved remarkable success on content-driven tasks, yet they remain limited in capturing speaker-discriminative features critical for verification, diarization, and profiling applications. We introduce \textsc{DELULU}, a speaker-aware self-trained foundational model that addresses this limitation by incorporating speaker-informed structure into pseudo-label generation. DELULU leverages frame-level embeddings from ReDimNet, a state-of-the-art speaker verification model, to guide k-means clustering during pre-training, introducing a speaker-discriminative inductive bias that aligns representation learning with speaker identity. DELULU significantly outperforms prior SSL models across a range of speaker-centric tasks, achieving up to \textbf{62\% relative improvement} in equal error rate (EER) for speaker verification and consistent gains on zero-shot profiling tasks including gender, age, accent, and speaker counting; notably surpassing even its teacher model on zero-shot evaluations. Our findings demonstrate that \textbf{DELULU is a strong universal encoder for speaker-aware speech processing}, enabling superior performance without task-specific fine-tuning.
翻译:自监督语音模型在内容驱动任务中取得了显著成功,但在捕捉说话人判别性特征方面仍存在局限,而这些特征对于说话人验证、日志分析及画像应用至关重要。我们提出\textsc{DELULU},一种说话人感知的自训练基础模型,通过在伪标签生成中融入说话人信息结构来解决这一局限。DELULU利用来自ReDimNet(一种最先进的说话人验证模型)的帧级嵌入来指导预训练期间的k-means聚类,引入说话人判别性的归纳偏置,使表示学习与说话人身份对齐。在一系列说话人中心任务中,DELULU显著优于先前的自监督学习模型,在说话人验证的等错误率(EER)上实现高达\textbf{62\%的相对改进},并在零样本画像任务(包括性别、年龄、口音和说话人计数)中取得持续增益;尤为值得注意的是,它在零样本评估中甚至超越了其教师模型。我们的研究结果表明,\textbf{DELULU是用于说话人感知语音处理的强大通用编码器},无需特定任务微调即可实现卓越性能。