Multilingual benchmarks rarely test reasoning over culturally grounded premises: translated datasets keep English-centric scenarios, while culture-first datasets often lack control over the reasoning required. We propose Macaron, a template-first benchmark that factorizes reasoning type and cultural aspect across question languages. Using 100 language-agnostic templates that cover 7 reasoning types, 22 cultural aspects, native annotators create scenario-aligned English and local-language multiple-choice questions and systematically derived True/False questions. Macaron contains 11,862 instances spanning 20 countries/cultural contexts, 10 scripts, and 20 languages (including low-resource ones like Amharic, Yoruba, Zulu, Kyrgyz, and some Arabic dialects). In zero-shot evaluation of 21 multilingual LLMs, reasoning-mode models achieve the strongest performance and near-parity between English and local languages, while open-weight models degrade substantially in local languages and often approach chance on T/F tasks. Culture-grounded mathematical and counting templates are consistently the hardest. The data can be accessed here https://huggingface.co/datasets/AlaaAhmed2444/Macaron.
翻译:多语言基准测试很少评估基于文化背景前提的推理能力:翻译数据集通常保留以英语为中心的场景,而文化优先的数据集往往缺乏对所需推理类型的控制。我们提出Macaron——一个采用模板优先设计的基准测试,可在不同问题语言中分解推理类型与文化维度。通过100个涵盖7种推理类型和22个文化维度的语言无关模板,母语标注者创建了场景对齐的英语与本地语言多项选择题,并系统推导出真/假判断题。Macaron包含11,862个测试实例,涵盖20个国家/文化语境、10种文字系统和20种语言(包括阿姆哈拉语、约鲁巴语、祖鲁语、吉尔吉斯语等低资源语言及部分阿拉伯语方言)。在对21个多语言大语言模型的零样本评估中,推理导向模型表现出最强性能,其英语与本地语言表现接近持平;而开源模型在本地语言中性能显著下降,在真/假任务中常接近随机猜测水平。基于文化的数学与计数模板持续成为最困难的测试类型。数据可通过 https://huggingface.co/datasets/AlaaAhmed2444/Macaron 获取。