Cross-lingual question answering (CLQA) is a complex problem, comprising cross-lingual retrieval from a multilingual knowledge base, followed by answer generation either in English or the query language. Both steps are usually tackled by separate models, requiring substantial annotated datasets, and typically auxiliary resources, like machine translation systems to bridge between languages. In this paper, we show that CLQA can be addressed using a single encoder-decoder model. To effectively train this model, we propose a self-supervised method based on exploiting the cross-lingual link structure within Wikipedia. We demonstrate how linked Wikipedia pages can be used to synthesise supervisory signals for cross-lingual retrieval, through a form of cloze query, and generate more natural queries to supervise answer generation. Together, we show our approach, \texttt{CLASS}, outperforms comparable methods on both supervised and zero-shot language adaptation settings, including those using machine translation.
翻译:摘要:跨语言问答(CLQA)是一个复杂问题,包含从多语言知识库中进行跨语言检索,以及随后以英文或查询语言生成答案两个步骤。这两个步骤通常由独立模型处理,需要大量标注数据集及辅助资源(如机器翻译系统)来桥接语言差异。本文表明,CLQA可通过单一编码器-解码器模型来解决。为有效训练该模型,我们提出一种基于维基百科跨语言链接结构的自监督方法。我们展示了如何利用关联的维基百科页面,通过一种完形查询形式合成跨语言检索的监督信号,并生成更自然的查询来监督答案生成。综合而言,我们提出的方法\texttt{CLASS}在监督学习和零样本语言适应场景(包括使用机器翻译的方法)中均优于同类方法。