AfriEconQA: A Benchmark Dataset for African Economic Analysis based on World Bank Reports

We introduce AfriEconQA, a specialized benchmark dataset for African economic analysis grounded in a comprehensive corpus of 236 World Bank reports. The task of AfriEconQA is to answer complex economic queries that require high-precision numerical reasoning and temporal disambiguation from specialized institutional documents. The dataset consists of 8,937 curated QA instances, rigorously filtered from a pool of 10018 synthetic questions to ensure high-quality evidence-answer alignment. Each instance is composed of: (1) a question requiring reasoning over economic indicators, (2) the corresponding evidence retrieved from the corpus, (3) a verified ground-truth answer, and (4) source metadata (e.g., URL and publication date) to ensure temporal provenance. AfriEconQA is the first benchmark focused specifically on African economic analysis, providing a unique challenge for Information Retrieval (IR) systems, as the data is largely absent from the pretraining corpora of current Large Language Models (LLMs). We operationalize this dataset through an 11-experiment matrix, benchmarking a zero-shot baseline (GPT-5 Mini) against RAG configurations using GPT-4o and Qwen 32B across five distinct embedding and ranking strategies. Our results demonstrate a severe parametric knowledge gap, where zero-shot models fail to answer over 90 percent of queries, and even state-of-the-art RAG pipelines struggle to achieve high precision. This confirms AfriEconQA as a robust and challenging benchmark for the next generation of domain-specific IR and RAG systems. The AfriEconQA dataset and code will be made publicly available upon publication.

翻译：我们介绍了AfriEconQA，这是一个专门用于非洲经济分析的基准数据集，其构建基于一个包含236份世界银行报告的全面语料库。AfriEconQA的任务是回答复杂的经济查询，这些查询需要从专业机构文档中进行高精度数值推理和时间消歧。该数据集包含8,937个精心整理的问答实例，这些实例是从10018个合成问题池中经过严格筛选而来，以确保证据与答案的高质量对齐。每个实例由以下部分组成：(1) 一个需要对经济指标进行推理的问题，(2) 从语料库中检索到的相应证据，(3) 一个经过验证的真实答案，以及 (4) 确保时间溯源的来源元数据（例如URL和出版日期）。AfriEconQA是首个专门聚焦于非洲经济分析的基准，为信息检索系统提供了一个独特的挑战，因为当前大型语言模型的预训练语料库中基本不包含此类数据。我们通过一个包含11组实验的矩阵来操作该数据集，将零样本基线模型与使用GPT-4o和Qwen 32B的RAG配置进行了基准测试，并比较了五种不同的嵌入和排序策略。我们的结果显示了一个严重的参数知识鸿沟：零样本模型无法回答超过90%的查询，即使是最先进的RAG流程也难以实现高精度。这证实了AfriEconQA是下一代领域特定信息检索和RAG系统的一个稳健且具有挑战性的基准。AfriEconQA数据集和代码将在论文发表后公开提供。