QASiNa: Religious Domain Question Answering using Sirah Nabawiyah

Nowadays, Question Answering (QA) tasks receive significant research focus, particularly with the development of Large Language Model (LLM) such as Chat GPT [1]. LLM can be applied to various domains, but it contradicts the principles of information transmission when applied to the Islamic domain. In Islam we strictly regulates the sources of information and who can give interpretations or tafseer for that sources [2]. The approach used by LLM to generate answers based on its own interpretation is similar to the concept of tafseer, LLM is neither an Islamic expert nor a human which is not permitted in Islam. Indonesia is the country with the largest Islamic believer population in the world [3]. With the high influence of LLM, we need to make evaluation of LLM in religious domain. Currently, there is only few religious QA dataset available and none of them using Sirah Nabawiyah especially in Indonesian Language. In this paper, we propose the Question Answering Sirah Nabawiyah (QASiNa) dataset, a novel dataset compiled from Sirah Nabawiyah literatures in Indonesian language. We demonstrate our dataset by using mBERT [4], XLM-R [5], and IndoBERT [6] which fine-tuned with Indonesian translation of SQuAD v2.0 [7]. XLM-R model returned the best performance on QASiNa with EM of 61.20, F1-Score of 75.94, and Substring Match of 70.00. We compare XLM-R performance with Chat GPT-3.5 and GPT-4 [1]. Both Chat GPT version returned lower EM and F1-Score with higher Substring Match, the gap of EM and Substring Match get wider in GPT-4. The experiment indicate that Chat GPT tends to give excessive interpretations as evidenced by its higher Substring Match scores compared to EM and F1-Score, even after providing instruction and context. This concludes Chat GPT is unsuitable for question answering task in religious domain especially for Islamic religion.

翻译：当前，问答任务因大型语言模型（如Chat GPT[1]）的发展而备受研究关注。虽然LLM可应用于多个领域，但当其涉及伊斯兰领域时，却与信息传播原则相悖。伊斯兰教严格规范信息来源及其解释权限，禁止非专业者或非人类进行tafseer（经注）[2]。LLM基于自身理解生成答案的方式类似tafseer，但LLM既非伊斯兰专家亦非人类，故伊斯兰教义不允许其参与。印度尼西亚作为全球穆斯林人口最多的国家[3]，面对LLM的广泛影响，亟需评估其在宗教领域中的表现。目前宗教领域问答数据集稀缺，尚无使用《圣训传》（Sirah Nabawiyah）的印尼语数据集。本文提出问答数据集QASiNa（Question Answering Sirah Nabawiyah），该数据集从印尼语《圣训传》文献中构建。我们通过微调SQuAD v2.0[7]印尼语译本的mBERT[4]、XLM-R[5]和IndoBERT[6]进行实验验证。其中，XLM-R模型在QASiNa上表现最佳，EM得分为61.20，F1分数为75.94，子串匹配得分为70.00。我们进一步对比XLM-R与Chat GPT-3.5及GPT-4[1]的性能：两款Chat GPT版本均呈现更低的EM和F1分数，但子串匹配得分更高，且GPT-4中EM与子串匹配的差距更为显著。实验表明，即使提供指令和上下文，Chat GPT仍倾向于过度解释（其子串匹配得分显著高于EM和F1分数）。由此得出结论：Chat GPT不适用于宗教领域（尤其是伊斯兰教）的问答任务。