Multiple choice question answering tasks evaluate the reasoning, comprehension, and mathematical abilities of Large Language Models (LLMs). While existing benchmarks employ automatic translation for multilingual evaluation, this approach is error-prone and potentially introduces culturally biased questions, especially in social sciences. We introduce the first multitask, multiple-choice Turkish QA benchmark, TurkishMMLU, to evaluate LLMs' understanding of the Turkish language. TurkishMMLU includes over 10,000 questions, covering 9 different subjects from Turkish high-school education curricula. These questions are written by curriculum experts, suitable for the high-school curricula in Turkey, covering subjects ranging from natural sciences and math questions to more culturally representative topics such as Turkish Literature and the history of the Turkish Republic. We evaluate over 20 LLMs, including multilingual open-source (e.g., Gemma, Llama, MT5), closed-source (GPT 4o, Claude, Gemini), and Turkish-adapted (e.g., Trendyol) models. We provide an extensive evaluation, including zero-shot and few-shot evaluation of LLMs, chain-of-thought reasoning, and question difficulty analysis along with model performance. We provide an in-depth analysis of the Turkish capabilities and limitations of current LLMs to provide insights for future LLMs for the Turkish language. We publicly release our code for the dataset and evaluation: https://github.com/ArdaYueksel/TurkishMMLU.
翻译:多项选择题问答任务评估大型语言模型(LLM)的推理、理解和数学能力。现有基准测试采用自动翻译进行多语言评估,但这种方法容易出错,并可能引入具有文化偏见的问题,尤其在社会科学领域。我们提出了首个多任务、多项选择的土耳其语问答基准测试TurkishMMLU,用于评估LLM对土耳其语的理解能力。TurkishMMLU包含超过10,000道题目,涵盖土耳其高中教育课程中的9个不同学科。这些题目由课程专家编写,符合土耳其高中课程标准,覆盖范围从自然科学和数学问题到更具文化代表性的主题,如土耳其文学和土耳其共和国历史。我们评估了超过20个LLM,包括多语言开源模型(如Gemma、Llama、MT5)、闭源模型(GPT 4o、Claude、Gemini)以及土耳其语适配模型(如Trendyol)。我们提供了全面的评估,包括LLM的零样本和少样本评估、思维链推理以及题目难度分析与模型性能评估。我们深入分析了当前LLM在土耳其语能力方面的优势与局限,为未来土耳其语LLM的发展提供见解。我们公开发布了数据集和评估代码:https://github.com/ArdaYueksel/TurkishMMLU。