The choice of the objective function is crucial in emerging high-quality representations from self-supervised learning. This paper investigates how different formulations of the Barlow Twins (BT) objective impact downstream task performance for speech data. We propose Modified Barlow Twins (MBT) with normalized latents to enforce scale-invariance and evaluate on speaker identification, gender recognition and keyword spotting tasks. Our results show MBT improves representation generalization over original BT, especially when fine-tuning with limited target data. This highlights the importance of designing objectives that encourage invariant and transferable representations. Our analysis provides insights into how the BT learning objective can be tailored to produce speech representations that excel when adapted to new downstream tasks. This study is an important step towards developing reusable self-supervised speech representations.
翻译:目标函数的选择对于从自监督学习中产生高质量表征至关重要。本文研究了Barlow Twins(BT)目标的不同公式化方法如何影响语音数据的下游任务性能。我们提出采用归一化潜变量的改进型Barlow Twins(MBT)以强制执行尺度不变性,并在说话人识别、性别识别和关键词检测任务上进行评估。结果表明,与原始BT相比,MBT提升了表征的泛化能力,尤其是在有限目标数据微调的情况下。这凸显了设计能够促进不变性与可迁移表征的目标函数的重要性。我们的分析揭示了如何调整BT学习目标以产生在适配新下游任务时表现优异的语音表征。本研究是向开发可复用自监督语音表征迈出的重要一步。