Although mathematics is often considered culturally neutral, the way mathematical problems are presented can carry implicit cultural context. Existing benchmarks like GSM8K are predominantly rooted in Western norms, including names, currencies, and everyday scenarios. In this work, we create culturally adapted variants of the GSM8K test set for five regions Africa, India, China, Korea, and Japan using prompt-based transformations followed by manual verification. We evaluate six large language models (LLMs), ranging from 8B to 72B parameters, across five prompting strategies to assess their robustness to cultural variation in math problem presentation. Our findings reveal a consistent performance gap: models perform best on the original US-centric dataset and comparatively worse on culturally adapted versions. However, models with reasoning capabilities are more resilient to these shifts, suggesting that deeper reasoning helps bridge cultural presentation gaps in mathematical tasks
翻译:尽管数学常被视为文化中立的学科,但数学问题的呈现方式往往蕴含隐性的文化背景。现有基准测试集(如GSM8K)主要基于西方文化规范,涵盖姓名、货币及日常情境等元素。本研究通过基于提示的转换方法,辅以人工校验,为非洲、印度、中国、韩国和日本五个地区创建了GSM8K测试集的文化适应变体。我们评估了六款参数量从8B到72B的大型语言模型(LLMs),采用五种提示策略,以检验其对数学问题呈现中文化差异的鲁棒性。研究结果表明存在一致的性能差距:模型在以美国为中心的原数据集上表现最佳,而在文化适应版本上表现相对较差。然而,具备推理能力的模型对这些文化转换展现出更强的适应性,这表明深层推理有助于弥合数学任务中文化呈现方式带来的差异。