We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.
翻译:我们介绍了AfriEconQA,这是一个专门用于非洲经济分析的基准数据集,其构建基于一个包含236份世界银行报告的全面语料库。AfriEconQA的任务是回答复杂的经济查询,这些查询需要从专业机构文档中进行高精度数值推理和时间消歧。该数据集包含8,937个精心整理的问答实例,这些实例是从10018个合成问题池中经过严格筛选而来,以确保证据与答案的高质量对齐。每个实例由以下部分组成:(1) 一个需要对经济指标进行推理的问题,(2) 从语料库中检索到的相应证据,(3) 一个经过验证的真实答案,以及 (4) 确保时间溯源的来源元数据(例如URL和出版日期)。AfriEconQA是首个专门聚焦于非洲经济分析的基准,为信息检索系统提供了一个独特的挑战,因为当前大型语言模型的预训练语料库中基本不包含此类数据。我们通过一个包含11组实验的矩阵来操作该数据集,将零样本基线模型与使用GPT-4o和Qwen 32B的RAG配置进行了基准测试,并比较了五种不同的嵌入和排序策略。我们的结果显示了一个严重的参数知识鸿沟:零样本模型无法回答超过90%的查询,即使是最先进的RAG流程也难以实现高精度。这证实了AfriEconQA是下一代领域特定信息检索和RAG系统的一个稳健且具有挑战性的基准。AfriEconQA数据集和代码将在论文发表后公开提供。