End-to-end spoken language understanding (SLU) remains elusive even with current large pretrained language models on text and speech, especially in multilingual cases. Machine translation has been established as a powerful pretraining objective on text as it enables the model to capture high-level semantics of the input utterance and associations between different languages, which is desired for speech models that work on lower-level acoustic frames. Motivated particularly by the task of cross-lingual SLU, we demonstrate that the task of speech translation (ST) is a good means of pretraining speech models for end-to-end SLU on both monolingual and cross-lingual scenarios. By introducing ST, our models give higher performance over current baselines on monolingual and multilingual intent classification as well as spoken question answering using SLURP, MINDS-14, and NMSQA benchmarks. To verify the effectiveness of our methods, we also release two new benchmark datasets from both synthetic and real sources, for the tasks of abstractive summarization from speech and low-resource or zero-shot transfer from English to French. We further show the value of preserving knowledge from the pretraining task, and explore Bayesian transfer learning on pretrained speech models based on continual learning regularizers for that.
翻译:即使在当前文本和语音领域的大规模预训练语言模型支持下,端到端口语理解(SLU)仍然难以实现,尤其是在多语言场景中。机器翻译已被证明是一种有效的文本预训练目标,因为它能使模型捕捉输入话语的高层语义以及不同语言之间的关联,而这正是处理较低层级声学帧的语音模型所需的能力。受跨语言SLU任务的启发,我们证明了语音翻译(ST)任务是在单语言和跨语言场景下为端到端SLU预训练语音模型的有效手段。通过引入ST,我们的模型在单语言和多语言意图分类以及基于SLURP、MINDS-14和NMSQA基准测试的口语问答任务上,均优于现有基线方法。为验证方法的有效性,我们还从合成和真实数据源发布了两个新的基准数据集,分别用于语音抽象摘要以及从英语到法语的零资源或零样本迁移任务。我们进一步展示了保留预训练任务知识的重要性,并探索了基于持续学习正则化器的贝叶斯迁移学习方法在预训练语音模型上的应用。