In the recent years, speech representation learning is constructed primarily as a self-supervised learning (SSL) task, using the raw audio signal alone, while ignoring the side-information that is often available for a given speech recording. In this paper, we propose MASR, a Metadata Aware Speech Representation learning framework, which addresses the aforementioned limitations. MASR enables the inclusion of multiple external knowledge sources to enhance the utilization of meta-data information. The external knowledge sources are incorporated in the form of sample-level pair-wise similarity matrices that are useful in a hard-mining loss. A key advantage of the MASR framework is that it can be combined with any choice of SSL method. Using MASR representations, we perform evaluations on several downstream tasks such as language identification, speech recognition and other non-semantic tasks such as speaker and emotion recognition. In these experiments, we illustrate significant performance improvements for the MASR over other established benchmarks. We perform a detailed analysis on the language identification task to provide insights on how the proposed loss function enables the representations to separate closely related languages.
翻译:近年来,语音表示学习主要被构建为自监督学习(SSL)任务,仅利用原始音频信号,而忽略了语音录音中通常可用的辅助信息。在本文中,我们提出了MASR——一种元数据感知的语音表示学习框架,以解决上述局限性。MASR能够整合多个外部知识源,从而增强元数据信息的利用。这些外部知识源以样本级成对相似度矩阵的形式融入,该矩阵在硬挖掘损失中发挥作用。MASR框架的一个关键优势在于,它可以与任意选择的SSL方法结合使用。基于MASR表示,我们在语言识别、语音识别等下游任务以及说话人识别、情感识别等非语义任务上进行了评估。实验结果表明,与已有的基准方法相比,MASR在性能上取得了显著提升。我们针对语言识别任务进行了详细分析,以阐明所提出的损失函数如何使表示能够区分紧密相关的语言。