African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology.
翻译:非洲语言在数字环境中的内容覆盖远低于其他语言,这使得问答系统难以满足用户的信息需求。跨语言开放检索问答系统(XOR QA)——通过从其他语言检索答案内容,同时以用户母语提供服务——提供了一种弥合这一差距的方式。为此,我们创建了AfriQA,这是首个聚焦非洲语言的跨语言问答数据集。AfriQA包含涵盖10种非洲语言的12,000余个XOR QA样例。以往数据集主要关注跨语言问答可增强目标语言覆盖的语言场景,而AfriQA聚焦于跨语言答案内容是唯一高覆盖率答案来源的语言场景。基于此,我们认为非洲语言是XOR QA最重要且最现实的应用场景之一。我们的实验表明,自动翻译与多语言检索方法表现欠佳。总体而言,AfriQA对现有最优问答模型构成了挑战。我们期待该数据集能够推动更公平的问答技术发展。