End-to-end spoken language understanding (SLU) remains elusive even with current large pretrained language models on text and speech, especially in multilingual cases. Machine translation has been established as a powerful pretraining objective on text as it enables the model to capture high-level semantics of the input utterance and associations between different languages, which is desired for speech models that work on lower-level acoustic frames. Motivated particularly by the task of cross-lingual SLU, we demonstrate that the task of speech translation (ST) is a good means of pretraining speech models for end-to-end SLU on both intra- and cross-lingual scenarios. By introducing ST, our models reach higher performance over baselines on monolingual and multilingual intent classification as well as spoken question answering using SLURP, MINDS-14, and NMSQA benchmarks. To verify the effectiveness of our methods, we also create new benchmark datasets from both synthetic and real sources, for speech summarization and low-resource/zero-shot transfer from English to French or Spanish. We further show the value of preserving knowledge for the ST pretraining task for better downstream performance, possibly using Bayesian transfer regularizers.
翻译:即使在当前大型预训练语言模型(基于文本和语音)的背景下,端到端口语理解(SLU)仍难以实现,尤其是在多语言场景中。机器翻译已被证明是一种强大的文本预训练目标,因为它能使模型捕捉输入语句的高层语义以及不同语言之间的关联——这正是处理低层级声学帧的语音模型所需的能力。受跨语言SLU任务的启发,我们证明语音翻译(ST)作为语音模型的预训练手段,在单语言和跨语言场景中均能有效支持端到端SLU。通过引入ST,我们的模型在单语言和多语言意图分类以及口语问答任务上(基于SLURP、MINDS-14和NMSQA基准)取得了优于基线的性能。为验证方法的有效性,我们还基于合成数据与真实数据创建了新的基准数据集,涵盖语音摘要以及从英语到法语或西班牙语的低资源/零样本迁移任务。此外,我们进一步展示了保护ST预训练任务知识的价值(可能借助贝叶斯迁移正则化器),以提升下游任务的性能。