Grounding Arabic LLMs in the Doha Historical Dictionary: Retrieval-Augmented Understanding of Quran and Hadith

Large language models (LLMs) have achieved remarkable progress in many language tasks, yet they continue to struggle with complex historical and religious Arabic texts such as the Quran and Hadith. To address this limitation, we develop a retrieval-augmented generation (RAG) framework grounded in diachronic lexicographic knowledge. Unlike prior RAG systems that rely on general-purpose corpora, our approach retrieves evidence from the Doha Historical Dictionary of Arabic (DHDA), a large-scale resource documenting the historical development of Arabic vocabulary. The proposed pipeline combines hybrid retrieval with an intent-based routing mechanism to provide LLMs with precise, contextually relevant historical information. Our experiments show that this approach improves the accuracy of Arabic-native LLMs, including Fanar and ALLaM, to over 85\%, substantially reducing the performance gap with Gemini, a proprietary large-scale model. Gemini also serves as an LLM-as-a-judge system for automatic evaluation in our experiments. The automated judgments were verified through human evaluation, demonstrating high agreement (kappa = 0.87). An error analysis further highlights key linguistic challenges, including diacritics and compound expressions. These findings demonstrate the value of integrating diachronic lexicographic resources into retrieval-augmented generation frameworks to enhance Arabic language understanding, particularly for historical and religious texts. The code and resources are publicly available at: https://github.com/somayaeltanbouly/Doha-Dictionary-RAG.

翻译：大型语言模型（LLM）在多种语言任务中取得了显著进展，但在处理《古兰经》和圣训等复杂的历史与宗教阿拉伯语文本时仍存在困难。为解决这一局限，我们开发了一种基于历时词典学知识的检索增强生成（RAG）框架。与依赖通用语料库的现有RAG系统不同，我们的方法从多哈阿拉伯语历史词典（DHDA）中检索证据，该词典是记录阿拉伯语词汇历史演变的规模化资源。所提出的流水线结合了混合检索与基于意图的路由机制，为LLM提供精准且上下文相关的历史信息。实验表明，该方法将包括Fanar和ALLaM在内的原生阿拉伯语大语言模型的准确率提升至85%以上，显著缩小了其与专有大规模模型Gemini之间的性能差距。Gemini还作为自动评估的"大语言模型即裁判"系统参与实验。自动评判结果经人工评估验证，显示出高度一致性（kappa=0.87）。误差分析进一步揭示了包括变音符号和复合表达在内的关键语言挑战。这些发现证明了将历时词典学资源整合至检索增强生成框架的价值，能够增强阿拉伯语语言理解能力，尤其针对历史与宗教文本。相关代码和资源已公开于：https://github.com/somayaeltanbouly/Doha-Dictionary-RAG。