Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.
翻译:大语言模型(LLMs)越来越多地被用于回答日常问题,然而其在文化基础和方言内容上的表现仍因语言而异。我们提出了一种综合方法,该方法(i)将现代标准阿拉伯语(MSA)的多项选择题(MCQs)翻译成英语和多种阿拉伯语方言,(ii)将其转换为开放式问题(OEQs),(iii)在MCQ和OEQ两种设置下对一系列零样本和微调的LLMs进行基准测试,以及(iv)生成思维链(CoT)推理过程,以微调模型进行逐步推理。利用此方法,我们扩展了一个现有数据集,其中问答在多种语言变体间平行对齐,据我们所知,这是首个此类数据集。我们使用开源和闭源模型进行了大量实验。我们的研究结果表明:(i)模型在阿拉伯语方言上表现不佳,揭示了在文化基础和方言特定知识方面存在持续差距;(ii)以阿拉伯语为中心的模型在MCQs上表现良好,但在OEQs上表现挣扎;以及(iii)CoT提高了基于判断的正确性,但在基于n-gram的指标上产生了混合结果。所开发的数据集将公开发布,以支持文化和语言包容性评估的进一步研究。