Large Language Models (LLMs) have made significant progress in incorporating Indic languages within multilingual models. However, it is crucial to quantitatively assess whether these languages perform comparably to globally dominant ones, such as English. Currently, there is a lack of benchmark datasets specifically designed to evaluate the regional knowledge of LLMs in various Indic languages. In this paper, we present the L3Cube-IndicQuest, a gold-standard factual question-answering benchmark dataset designed to evaluate how well multilingual LLMs capture regional knowledge across various Indic languages. The dataset contains 200 question-answer pairs, each for English and 19 Indic languages, covering five domains specific to the Indic region. We aim for this dataset to serve as a benchmark, providing ground truth for evaluating the performance of LLMs in understanding and representing knowledge relevant to the Indian context. The IndicQuest can be used for both reference-based evaluation and LLM-as-a-judge evaluation. The dataset is shared publicly at https://github.com/l3cube-pune/indic-nlp .
翻译:大语言模型(LLMs)在将印度语言纳入多语言模型方面已取得显著进展。然而,定量评估这些语言是否与英语等全球主导语言表现相当至关重要。目前,专门用于评估LLMs在各种印度语言中区域知识水平的基准数据集仍然匮乏。本文提出了L3Cube-IndicQuest,这是一个黄金标准的事实性问答基准数据集,旨在评估多语言LLMs对不同印度语言区域知识的掌握程度。该数据集包含200个问答对,涵盖英语及19种印度语言,涉及印度地区特有的五个领域。我们希望该数据集能作为基准,为评估LLMs在理解和表征印度相关语境知识方面的性能提供基本事实依据。IndicQuest既可用于基于参考的评估,也可用于LLM作为评判者的评估。数据集已在https://github.com/l3cube-pune/indic-nlp 公开共享。