Existing cultural commonsense benchmarks treat nations as monolithic, assuming uniform practices within national boundaries. But does cultural commonsense hold uniformly within a nation, or does it vary at the sub-national level? We introduce Indica, the first benchmark designed to test LLMs' ability to address this question, focusing on India - a nation of 28 states, 8 union territories, and 22 official languages. We collect human-annotated answers from five Indian regions (North, South, East, West, and Central) across 515 questions spanning 8 domains of everyday life, yielding 1,630 region-specific question-answer pairs. Strikingly, only 39.4% of questions elicit agreement across all five regions, demonstrating that cultural commonsense in India is predominantly regional, not national. We evaluate eight state-of-the-art LLMs and find two critical gaps: models achieve only 13.4%-20.9% accuracy on region-specific questions, and they exhibit geographic bias, over-selecting Central and North India as the "default" (selected 30-40% more often than expected) while under-representing East and West. Beyond India, our methodology provides a generalizable framework for evaluating cultural commonsense in any culturally heterogeneous nation, from question design grounded in anthropological taxonomy, to regional data collection, to bias measurement.
翻译:现有文化常识基准将国家视为单一整体,假设国家边界内的实践具有统一性。但文化常识是否在一国范围内普遍适用,还是在次国家层面存在差异?我们推出首个旨在检验大型语言模型应对此问题能力的基准——Indica,聚焦于拥有28个邦、8个中央直辖区和22种官方语言的印度。我们收集了来自印度五个区域(北部、南部、东部、西部和中部)人工标注的答案,涵盖日常生活8个领域的515个问题,共获得1,630组区域特异性问答对。引人注目的是,仅39.4%的问题在五个区域获得一致回答,这表明印度的文化常识主要呈现区域性而非全国性特征。我们对八个前沿大型语言模型进行评估,发现两个关键缺陷:模型在区域特异性问题上的准确率仅为13.4%-20.9%,且表现出地理偏见——过度选择中部和北部印度作为“默认”区域(选择频率比预期高30-40%),同时低估东部和西部区域的代表性。超越印度语境,我们的方法论为评估任何文化异质性国家的文化常识提供了可推广的框架,包括基于人类学分类学的问题设计、区域性数据收集以及偏见测量。