Recent work on speech representation models jointly pre-trained with text has demonstrated the potential of improving speech representations by encoding speech and text in a shared space. In this paper, we leverage such shared representations to address the persistent challenge of limited data availability in spoken language understanding tasks. By employing a pre-trained speech-text model, we find that models fine-tuned on text can be effectively transferred to speech testing data. With as little as 1 hour of labeled speech data, our proposed approach achieves comparable performance on spoken language understanding tasks (specifically, sentiment analysis and named entity recognition) when compared to previous methods using speech-only pre-trained models fine-tuned on 10 times more data. Beyond the proof-of-concept study, we also analyze the latent representations. We find that the bottom layers of speech-text models are largely task-agnostic and align speech and text representations into a shared space, while the top layers are more task-specific.
翻译:近期关于语音表示模型与文本联合预训练的研究表明,通过将语音和文本编码至共享空间,能有效提升语音表示能力。本文利用此类共享表示,解决口语理解任务中持续存在的数据有限挑战。基于预训练的语音-文本模型,我们发现针对文本微调的模型可高效迁移至语音测试数据。仅需1小时标注语音数据,所提方法在口语理解任务(具体为情感分析和命名实体识别)中即可达到与先前方法(采用仅语音预训练模型且微调数据量多十倍)相当的性能。除概念验证研究外,我们还分析了潜在表示:语音-文本模型底层在很大程度上与任务无关,能将语音和文本表示对齐至共享空间,而顶层则更具任务特异性。