AfriQA: Cross-lingual Open-Retrieval Question Answering for African Languages

Odunayo Ogundepo,Tajuddeen R. Gwadabe,Clara E. Rivera,Jonathan H. Clark,Sebastian Ruder,David Ifeoluwa Adelani,Bonaventure F. P. Dossou,Abdou Aziz DIOP,Claytone Sikasote,Gilles Hacheme,Happy Buzaaba,Ignatius Ezeani,Rooweither Mabuya,Salomey Osei,Chris Emezue,Albert Njoroge Kahira,Shamsuddeen H. Muhammad,Akintunde Oladipo,Abraham Toluwase Owodunni,Atnafu Lambebo Tonja,Iyanuoluwa Shode,Akari Asai,Tunde Oluwaseyi Ajayi,Clemencia Siro,Steven Arthur,Mofetoluwa Adeyemi,Orevaoghene Ahia,Aremu Anuoluwapo,Oyinkansola Awosan,Chiamaka Chukwuneke,Bernard Opoku,Awokoya Ayodele,Verrah Otiende,Christine Mwase,Boyd Sinkala,Andre Niyongabo Rubungo,Daniel A. Ajisafe,Emeka Felix Onwuegbuzia,Habib Mbow,Emile Niyomutabazi,Eunice Mukonde,Falalu Ibrahim Lawan,Ibrahim Said Ahmad,Jesujoba O. Alabi,Martin Namukombo,Mbonu Chinedu,Mofya Phiri,Neo Putini,Ndumiso Mngoma,Priscilla A. Amuok,Ruqayya Nasir Iro,Sonia Adhiambo34

African languages have far less in-language content available digitally, making it challenging for question answering systems to satisfy the information needs of users. Cross-lingual open-retrieval question answering (XOR QA) systems -- those that retrieve answer content from other languages while serving people in their native language -- offer a means of filling this gap. To this end, we create AfriQA, the first cross-lingual QA dataset with a focus on African languages. AfriQA includes 12,000+ XOR QA examples across 10 African languages. While previous datasets have focused primarily on languages where cross-lingual QA augments coverage from the target language, AfriQA focuses on languages where cross-lingual answer content is the only high-coverage source of answer content. Because of this, we argue that African languages are one of the most important and realistic use cases for XOR QA. Our experiments demonstrate the poor performance of automatic translation and multilingual retrieval methods. Overall, AfriQA proves challenging for state-of-the-art QA models. We hope that the dataset enables the development of more equitable QA technology.

翻译：非洲语言在数字环境中的内容覆盖远低于其他语言，这使得问答系统难以满足用户的信息需求。跨语言开放检索问答系统（XOR QA）——通过从其他语言检索答案内容，同时以用户母语提供服务——提供了一种弥合这一差距的方式。为此，我们创建了AfriQA，这是首个聚焦非洲语言的跨语言问答数据集。AfriQA包含涵盖10种非洲语言的12,000余个XOR QA样例。以往数据集主要关注跨语言问答可增强目标语言覆盖的语言场景，而AfriQA聚焦于跨语言答案内容是唯一高覆盖率答案来源的语言场景。基于此，我们认为非洲语言是XOR QA最重要且最现实的应用场景之一。我们的实验表明，自动翻译与多语言检索方法表现欠佳。总体而言，AfriQA对现有最优问答模型构成了挑战。我们期待该数据集能够推动更公平的问答技术发展。