Self-supervised learning (SSL) for speech representation has been successfully applied in various downstream tasks, such as speech and speaker recognition. More recently, speech SSL models have also been shown to be beneficial in advancing spoken language understanding tasks, implying that the SSL models have the potential to learn not only acoustic but also linguistic information. In this paper, we aim to clarify if speech SSL techniques can well capture linguistic knowledge. For this purpose, we introduce SpeechGLUE, a speech version of the General Language Understanding Evaluation (GLUE) benchmark. Since GLUE comprises a variety of natural language understanding tasks, SpeechGLUE can elucidate the degree of linguistic ability of speech SSL models. Experiments demonstrate that speech SSL models, although inferior to text-based SSL models, perform better than baselines, suggesting that they can acquire a certain amount of general linguistic knowledge from just unlabeled speech data.
翻译:自监督学习(SSL)在语音表示领域已成功应用于多种下游任务,如语音识别和说话人识别。近期研究表明,语音SSL模型还能推进口语理解任务的发展,这意味着SSL模型不仅可能学习声学信息,还可能掌握语言信息。本文旨在探究语音SSL技术能否良好捕获语言知识。为此,我们引入SpeechGLUE——通用语言理解评估(GLUE)基准的语音版本。由于GLUE包含多种自然语言理解任务,SpeechGLUE能够阐明语音SSL模型的语言能力程度。实验表明,语音SSL模型虽不及基于文本的SSL模型,但其表现优于基线模型,说明这些模型仅通过无标签语音数据即可获取一定程度的通用语言知识。