CulturalBench: a Robust, Diverse and Challenging Benchmark on Measuring the (Lack of) Cultural Knowledge of LLMs

Yu Ying Chiu,Liwei Jiang,Bill Yuchen Lin,Chan Young Park,Shuyue Stella Li,Sahithya Ravi,Mehar Bhatia,Maria Antoniak,Yulia Tsvetkov,Vered Shwartz,Yejin Choi

from arxiv, Preprint. Under review

To make large language models (LLMs) more helpful across diverse cultures, it is essential to have effective cultural knowledge benchmarks to measure and track our progress. Effective benchmarks need to be robust, diverse, and challenging. We introduce CulturalBench: a set of 1,227 human-written and human-verified questions for effectively assessing LLMs' cultural knowledge, covering 45 global regions including the underrepresented ones like Bangladesh, Zimbabwe, and Peru. Questions - each verified by five independent annotators - span 17 diverse topics ranging from food preferences to greeting etiquettes. We evaluate models on two setups: CulturalBench-Easy and CulturalBench-Hard which share the same questions but asked differently. We find that LLMs are sensitive to such difference in setups (e.g., GPT-4o with 27.3% difference). Compared to human performance (92.6% accuracy), CulturalBench-Hard is more challenging for frontier LLMs with the best performing model (GPT-4o) at only 61.5% and the worst (Llama3-8b) at 21.4%. Moreover, we find that LLMs often struggle with tricky questions that have multiple correct answers (e.g., What utensils do the Chinese usually use?), revealing a tendency to converge to a single answer. Our results also indicate that OpenAI GPT-4o substantially outperform other proprietary and open source models in questions related to all but one region (Oceania). Nonetheless, all models consistently underperform on questions related to South America and the Middle East.

翻译：为使大语言模型（LLMs）在不同文化背景下更具实用性，建立有效的文化知识基准来衡量和追踪进展至关重要。有效的基准需具备鲁棒性、多样性和挑战性。本文提出CulturalBench：一个包含1,227道人工撰写且人工验证的问题集，用于有效评估LLMs的文化知识，涵盖包括孟加拉国、津巴布韦、秘鲁等代表性不足地区在内的全球45个区域。所有问题均经过五名独立标注者验证，涵盖从饮食偏好到问候礼仪等17个多样化主题。我们在两种设置下评估模型：CulturalBench-Easy和CulturalBench-Hard，两者问题相同但提问方式不同。研究发现LLMs对此类设置差异极为敏感（例如GPT-4o存在27.3%的准确率差异）。相较于人类表现（92.6%准确率），CulturalBench-Hard对前沿LLMs更具挑战性：最佳模型（GPT-4o）准确率仅为61.5%，最差模型（Llama3-8b）低至21.4%。此外，我们发现LLMs在处理具有多个正确答案的复杂问题时表现欠佳（例如“中国人通常使用什么餐具？”），呈现出向单一答案收敛的倾向。实验结果还表明，在除大洋洲外的所有区域相关问题上，OpenAI GPT-4o显著优于其他专有及开源模型。然而，所有模型在南美洲和中东地区相关问题上的表现持续欠佳。