Speech foundation models, pre-trained on large corpora of unlabelled speech data, produce general-purpose representations which are useful across tasks. However, these representations encode information about salient speech variables in a distributed manner, while downstream speech tasks rely on only some of this variability. In this work, we propose a post-training refinement approach using interventional contrastive learning. By leveraging an interventional dataset and multi-part contrastive loss, we learn a transformation from the entangled representation space of speech foundation models into separate content and speaker subspaces. We evaluate the learnt representations on speaker verification and keyword spotting tasks, showing improved out-of-domain speaker verification performance and evidence that speaker and content information are separated across the learned subspaces.
翻译:语音基础模型通过在海量无标注语音数据上进行预训练,生成适用于多种任务的通用表征。然而,这些表征以分布式方式编码了显著语音变量的信息,而下游语音任务仅依赖其中部分变量。本文提出一种基于干预式对比学习的后训练优化方法。通过利用干预式数据集和多部件对比损失,我们学习从语音基础模型的纠缠表征空间到独立内容子空间和说话人子空间的变换。在说话人验证和关键词检测任务上评估学习到的表征,结果表明:跨域说话人验证性能得到提升,并验证了说话人信息和内容信息在学习到的子空间中被有效分离。