Understanding culture requires reasoning across context, tradition, and implicit social knowledge, far beyond recalling isolated facts. Yet most culturally focused question answering (QA) benchmarks rely on single-hop questions, which may allow models to exploit shallow cues rather than demonstrate genuine cultural reasoning. In this work, we introduce ID-MoCQA, the first large-scale multi-hop QA dataset for assessing the cultural understanding of large language models (LLMs), grounded in Indonesian traditions and available in both English and Indonesian. We present a new framework that systematically transforms single-hop cultural questions into multi-hop reasoning chains spanning six clue types (e.g., commonsense, temporal, geographical). Our multi-stage validation pipeline, combining expert review and LLM-as-a-judge filtering, ensures high-quality question-answer pairs. Our evaluation across state-of-the-art models reveals substantial gaps in cultural reasoning, particularly in tasks requiring nuanced inference. ID-MoCQA provides a challenging and essential benchmark for advancing the cultural competency of LLMs.
翻译:理解文化需要在语境、传统和隐含社会知识间进行推理,远非仅回忆孤立事实。然而,当前多数聚焦文化的问答基准依赖单跳问题,这可能导致模型利用浅层线索而非展现真正的文化推理能力。本研究提出ID-MoCQA,首个用于评估大语言模型文化理解能力的大规模多跳问答数据集,其以印尼传统为基础,提供英语和印尼语双语版本。我们提出一种新框架,系统地将单跳文化问题转化为跨越六种线索类型(如常识、时间、地理)的多跳推理链。通过结合专家评审与大语言模型即评判的过滤机制,我们构建的多阶段验证流程确保了高质量的问答对。对前沿模型的评估揭示了其在文化推理方面存在显著差距,尤其在需要细微推理的任务中。ID-MoCQA为推动大语言模型文化能力的发展提供了一个具有挑战性且不可或缺的基准。