Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. But does cultural commonsense hold uniformly within a nation, or does it vary at the sub-national level? We introduce Indica, the first benchmark designed to test LLMs' ability to address this question, focusing on India - a nation of 28 states, 8 union territories, and 22 official languages. We collect human-annotated answers from five Indian regions (North, South, East, West, and Central) across 515 questions spanning 8 domains of everyday life, yielding 1,630 region-specific question-answer pairs. Strikingly, only 39.4% of questions elicit agreement across all five regions, demonstrating that cultural commonsense in India is predominantly regional, not national. We evaluate eight state-of-the-art LLMs and find two critical gaps: models achieve only 13.4%-20.9% accuracy on region-specific questions, and they exhibit geographic bias, over-selecting Central and North India as the "default" (selected 30-40% more often than expected) while under-representing East and West. Beyond India, our methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation, from question design grounded in anthropological taxonomy, to regional data collection, to bias measurement.
翻译:现有的文化常识基准测试将国家视为单一整体,假定国家边界内的实践是统一的。然而,文化常识在国家内部是否具有一致性,还是在次国家层面存在差异?我们引入了Indica,这是首个旨在测试大语言模型(LLM)解决此问题能力的基准,聚焦于印度——一个拥有28个邦、8个中央直辖区和22种官方语言的国家。我们从印度五个区域(北部、南部、东部、西部和中部)收集了人类标注的答案,涵盖日常生活的8个领域的515个问题,产生了1,630个区域特定的问答对。引人注目的是,仅有39.4%的问题在所有五个区域中引发了一致回答,这表明印度的文化常识主要是区域性的,而非全国性的。我们评估了八个最先进的大语言模型,发现了两个关键差距:模型在区域特定问题上的准确率仅为13.4%-20.9%,并且它们表现出地理偏见,过度选择中部和北部印度作为“默认”区域(被选中的频率比预期高出30-40%),同时低估了东部和西部的代表性。超越印度,我们的方法论为评估任何文化异质性国家的文化常识提供了一个可推广的框架,从基于人类学分类法的问题设计,到区域数据收集,再到偏见测量。