MAWARITH: A Dataset and Benchmark for Legal Inheritance Reasoning with LLMs

Islamic inheritance law is challenging for large language models because solving inheritance cases requires complex, structured, multi-step reasoning and the correct application of juristic rules to compute heirs' shares. We introduce \textit{MAWARITH}, a large-scale annotated dataset of 12,500 Arabic inheritance cases for training and evaluating models on the full reasoning chain: (i) identifying eligible heirs, (ii) applying blocking (\textit{\d{h}ajb}) and allocation rules, and (iii) computing exact inheritance shares. To the best of our knowledge, \textit{MAWARITH} is the first Arabic corpus and benchmark designed for end-to-end Islamic inheritance reasoning. Unlike prior datasets that restrict inheritance case solving to multiple-choice questions, \textit{MAWARITH} supports the full reasoning chain and provides step-by-step solutions with justifications grounded in classical juristic sources and established inheritance rules, as well as exact share calculations. This enables models to learn how to generate detailed, step-by-step responses to user queries that reflect real-world Islamic inheritance cases. To evaluate models beyond final-answer accuracy, we propose \textit{MIR-E} (Mawarith Inheritance Reasoning Evaluation), a weighted multi-stage metric that scores key reasoning stages and captures error propagation across the pipeline. We evaluate six large language models in a zero-shot setting. A commercial model achieves about 90\%, whereas all evaluated open-source models remain below 50\%. Our error analysis identifies recurring failure patterns, including scenario misinterpretation, errors in heir identification, errors in share allocation, and missing or incorrect application of key inheritance rules such as \textit{\textquotesingle awl} and \textit{radd}. The \textit{MAWARITH} dataset is publicly available at https://gitlab.com/nlpresearcher/mawarith.

翻译：伊斯兰继承法对大语言模型具有挑战性，因为解决继承案件需要复杂、结构化、多步推理，并正确应用法律规则计算继承人份额。我们提出\textit{MAWARITH}——一个包含12500个阿拉伯语继承案例的大规模注释数据集，用于训练和评估模型在完整推理链上的表现：（i）识别合格继承人，（ii）应用阻断（\textit{\d{h}ajb}）与分配规则，（iii）计算精确继承份额。据我们所知，\textit{MAWARITH}是首个面向端到端伊斯兰继承推理的阿拉伯语语料库与基准。不同于以往将继承案件求解限制为多项选择题的数据集，\textit{MAWARITH}支持完整推理链，并提供基于经典法学渊源和既定继承规则的分步解决方案及精确份额计算。这使得模型能够学习如何生成反映真实世界伊斯兰继承案例的详细分步回答。为超越最终答案准确率评估模型，我们提出\textit{MIR-E}（Mawarith继承推理评估），一种加权多阶段指标，对关键推理阶段进行评分并捕捉流水线中的错误传播。我们在零样本设置下评估了六个大语言模型：商业模型达到约90%，而所有评估的开源模型均低于50%。我们的错误分析识别了常见失败模式，包括场景误解、继承人识别错误、份额分配错误，以及对\textit{\textquotesingle awl}和\textit{radd}等关键继承规则的缺失或错误应用。\textit{MAWARITH}数据集公开于https://gitlab.com/nlpresearcher/mawarith。