Reasoning-focused Question Answering (QA) has advanced rapidly with Large Language Models (LLMs), yet high-quality benchmarks for low-resource languages remain scarce. Persian, spoken by roughly 130 million people, lacks a comprehensive open-domain resource for evaluating reasoning-capable QA systems. We introduce PARSE, the first open-domain Persian reasoning QA benchmark, containing 10,800 questions across Boolean, multiple-choice, and factoid formats, with diverse reasoning types, difficulty levels, and answer structures. The benchmark is built via a controlled LLM-based generation pipeline and validated through human evaluation. We also ensure linguistic and factual quality through multi-stage filtering, annotation, and consistency checks. We benchmark multilingual and Persian LLMs under multiple prompting strategies and show that Persian prompts and structured prompting (CoT for Boolean/multiple-choice; few-shot for factoid) improve performance. Fine-tuning further boosts results, especially for Persian-specialized models. These findings highlight how PARSE supports both fair comparison and practical model adaptation. PARSE fills a critical gap in Persian QA research and provides a strong foundation for developing and evaluating reasoning-capable LLMs in low-resource settings.
翻译:随着大语言模型(LLM)的快速发展,推理型问答(QA)领域取得了显著进展,然而针对低资源语言的高质量基准测试集仍然匮乏。波斯语作为约1.3亿人使用的语言,目前缺乏一个用于评估具备推理能力的QA系统的综合性开放领域资源。我们推出了PARSE,这是首个面向波斯语的开放领域推理QA基准测试集,包含10,800个问题,涵盖布尔型、多项选择题型和事实型问答格式,并具有多样化的推理类型、难度级别和答案结构。该基准通过一个受控的基于LLM的生成流程构建,并经过人工评估验证。我们还通过多阶段过滤、标注和一致性检查来确保语言质量和事实准确性。我们基于多种提示策略对多语言及波斯语LLM进行了基准测试,结果表明波斯语提示和结构化提示(针对布尔型/多项选择题型使用思维链;针对事实型问答使用少样本示例)能够提升模型性能。微调进一步提高了结果,特别是对于波斯语专用模型。这些发现凸显了PARSE如何同时支持公平比较和实际模型适配。PARSE填补了波斯语QA研究中的一个关键空白,并为在低资源环境下开发和评估具备推理能力的LLM奠定了坚实基础。