Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems, yet this content can be hard to access for those that do not speak these languages. The leap forward in cross-lingual modeling quality offered by generative language models offers much promise, yet their raw generations often fall short in factuality. To improve trustworthiness in these systems, a promising direction is to attribute the answer to a retrieved source, possibly in a content-rich language different from the query. Our work is the first to study attribution for cross-lingual question answering. First, we collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system. To our surprise, we find that a substantial portion of the answers is not attributable to any retrieved passages (up to 50% of answers exactly matching a gold reference) despite the system being able to attend directly to the retrieved text. Second, to address this poor attribution level, we experiment with a wide range of attribution detection techniques. We find that Natural Language Inference models and PaLM 2 fine-tuned on a very small amount of attribution data can accurately detect attribution. Based on these models, we improve the attribution level of a cross-lingual question-answering system. Overall, we show that current academic generative cross-lingual QA systems have substantial shortcomings in attribution and we build tooling to mitigate these issues.
翻译:许多高资源语言中存在大量可信的答案内容,并通过问答系统可即时获取,但对于不掌握这些语言的用户而言,这些内容难以触及。生成式语言模型在跨语言建模质量上取得的突破性进展带来了巨大前景,但其原始生成内容在事实性上往往存在不足。为提升此类系统的可信度,一个颇具前景的方向是将答案归因于检索到的来源(可能来自与查询不同的丰富内容语言)。本研究首次针对跨语言问答中的归因问题展开探索。我们首先收集了5种语言的数据,用于评估当前最先进的跨语言问答系统的归因水平。令人意外的是,我们发现即使系统能够直接关注检索到的文本,仍有相当大比例的答案无法归因至任何检索段落(在与黄金参考答案完全匹配的答案中,该比例高达50%)。其次,为应对这一归因水平不足的问题,我们实验了多种归因检测技术。实验表明,基于极少量归因数据微调的自然语言推理模型与PaLM 2能够准确检测归因情况。基于这些模型,我们进一步提升了跨语言问答系统的归因水平。本研究整体揭示了当前学术界的生成式跨语言问答系统在归因方面存在显著缺陷,并开发了相应工具以缓解这些问题。