超越多项选择题：一个包含方言变体的开放式阿拉伯文化问答基准 (Beyond MCQ: An Open-Ended Arabic Cultural QA Benchmark with Dialect Variants)

from arxiv, Cultural Knowledge, Everyday Knowledge, Open-Ended Question, Chain-of-Thought, Large Language Models, Native, Multilingual, Language Diversity

Large Language Models (LLMs) are increasingly used to answer everyday questions, yet their performance on culturally grounded and dialectal content remains uneven across languages. We propose a comprehensive method that (i) translates Modern Standard Arabic (MSA) multiple-choice questions (MCQs) into English and several Arabic dialects, (ii) converts them into open-ended questions (OEQs), (iii) benchmarks a range of zero-shot and fine-tuned LLMs under both MCQ and OEQ settings, and (iv) generates chain-of-thought (CoT) rationales to fine-tune models for step-by-step reasoning. Using this method, we extend an existing dataset in which QAs are parallelly aligned across multiple language varieties, making it, to our knowledge, the first of its kind. We conduct extensive experiments with both open and closed models. Our findings show that (i) models underperform on Arabic dialects, revealing persistent gaps in culturally grounded and dialect-specific knowledge; (ii) Arabic-centric models perform well on MCQs but struggle with OEQs; and (iii) CoT improves judged correctness while yielding mixed n-gram-based metrics. The developed dataset will be publicly released to support further research on culturally and linguistically inclusive evaluation.

翻译：大语言模型（LLMs）越来越多地被用于回答日常问题，然而其在文化基础和方言内容上的表现仍因语言而异。我们提出了一种综合方法，该方法（i）将现代标准阿拉伯语（MSA）的多项选择题（MCQs）翻译成英语和多种阿拉伯语方言，（ii）将其转换为开放式问题（OEQs），（iii）在MCQ和OEQ两种设置下对一系列零样本和微调的LLMs进行基准测试，以及（iv）生成思维链（CoT）推理过程，以微调模型进行逐步推理。利用此方法，我们扩展了一个现有数据集，其中问答在多种语言变体间平行对齐，据我们所知，这是首个此类数据集。我们使用开源和闭源模型进行了大量实验。我们的研究结果表明：（i）模型在阿拉伯语方言上表现不佳，揭示了在文化基础和方言特定知识方面存在持续差距；（ii）以阿拉伯语为中心的模型在MCQs上表现良好，但在OEQs上表现挣扎；以及（iii）CoT提高了基于判断的正确性，但在基于n-gram的指标上产生了混合结果。所开发的数据集将公开发布，以支持文化和语言包容性评估的进一步研究。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】基于元内存传输的跨域少镜头语义分割，Remember the Difference: Cross-Domain Few-Shot Semantic Segmentation via Meta-Memory Transfer

专知会员服务

13+阅读 · 2022年3月12日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日

Connections between Support Vector Machines, Wasserstein distance and gradient-penalty GANs

专知会员服务

36+阅读 · 2019年10月17日