A number of methods have been proposed for End-to-End Spoken Language Understanding (E2E-SLU) using pretrained models, however their evaluation often lacks multilingual setup and tasks that require prediction of lexical fillers, such as slot filling. In this work, we propose a unified method that integrates multilingual pretrained speech and text models and performs E2E-SLU on six datasets in four languages in a generative manner, including the prediction of lexical fillers. We investigate how the proposed method can be improved by pretraining on widely available speech recognition data using several training objectives. Pretraining on 7000 hours of multilingual data allows us to outperform the state-of-the-art ultimately on two SLU datasets and partly on two more SLU datasets. Finally, we examine the cross-lingual capabilities of the proposed model and improve on the best known result on the PortMEDIA-Language dataset by almost half, achieving a Concept/Value Error Rate of 23.65%.
翻译:已有多种方法采用预训练模型实现端到端口语理解(E2E-SLU),然而其评估往往缺乏多语言设置以及需要预测词性填充(如槽位填充)的任务。本研究提出一种统一方法,集成多语言预训练语音与文本模型,以生成式方式在四种语言的六个数据集上执行E2E-SLU(包括词性填充预测)。我们探究如何通过利用多种训练目标在广泛可用的语音识别数据上进行预训练来提升所提方法。在7000小时多语言数据上的预训练使我们最终在两个SLU数据集上超越现有最优水平,并在另外两个SLU数据集上部分取得优势。最后,我们考察了所提模型的跨语言能力,并在PortMEDIA-Language数据集上相比已知最佳结果提升近一半,取得了23.65%的语义/值错误率。