Trustworthy answer content is abundant in many high-resource languages and is instantly accessible through question answering systems, yet this content can be hard to access for those that do not speak these languages. The leap forward in cross-lingual modeling quality offered by generative language models offers much promise, yet their raw generations often fall short in factuality. To improve trustworthiness in these systems, a promising direction is to attribute the answer to a retrieved source, possibly in a content-rich language different from the query. Our work is the first to study attribution for cross-lingual question answering. First, we collect data in 5 languages to assess the attribution level of a state-of-the-art cross-lingual QA system. To our surprise, we find that a substantial portion of the answers is not attributable to any retrieved passages (up to 50% of answers exactly matching a gold reference) despite the system being able to attend directly to the retrieved text. Second, to address this poor attribution level, we experiment with a wide range of attribution detection techniques. We find that Natural Language Inference models and PaLM 2 fine-tuned on a very small amount of attribution data can accurately detect attribution. Based on these models, we improve the attribution level of a cross-lingual question-answering system. Overall, we show that current academic generative cross-lingual QA systems have substantial shortcomings in attribution and we build tooling to mitigate these issues.
翻译:许多高资源语言中存在大量可信的答案内容,通过问答系统可即时获取,但这些内容对非母语使用者而言往往难以访问。生成式语言模型在跨语言建模质量上的突破带来了巨大前景,但其原始生成内容在事实性方面常存在不足。为提升此类系统的可信度,一个具有前景的方向是将答案归因于检索到的来源(可能使用与查询不同的内容丰富语言)。本研究首次针对跨语言问答中的归因问题展开探索。首先,我们收集了5种语言的数据,用于评估当前最先进跨语言问答系统的归因水平。令人惊讶的是,我们发现尽管系统可直接关注检索文本,但仍有相当比例的答案无法归因于任何检索段落(最高达50%的答案与黄金标准参考完全匹配)。其次,为改善这一归因不足问题,我们实验了多种归因检测技术。研究表明,自然语言推理模型以及基于极少量归因数据微调的PaLM 2模型能准确检测归因性。基于这些模型,我们成功提升了跨语言问答系统的归因水平。总体而言,我们揭示了当前学术级生成式跨语言问答系统在归因方面存在显著缺陷,并构建了缓解这些问题的工具。