Frontier large language models (LLMs) are developed by researchers and practitioners with skewed cultural backgrounds and on datasets with skewed sources. However, LLMs' (lack of) multicultural knowledge cannot be effectively assessed with current methods for developing benchmarks. Existing multicultural evaluations primarily rely on expensive and restricted human annotations or potentially outdated internet resources. Thus, they struggle to capture the intricacy, dynamics, and diversity of cultural norms. LLM-generated benchmarks are promising, yet risk propagating the same biases they are meant to measure. To synergize the creativity and expert cultural knowledge of human annotators and the scalability and standardizability of LLM-based automation, we introduce CulturalTeaming, an interactive red-teaming system that leverages human-AI collaboration to build truly challenging evaluation dataset for assessing the multicultural knowledge of LLMs, while improving annotators' capabilities and experiences. Our study reveals that CulturalTeaming's various modes of AI assistance support annotators in creating cultural questions, that modern LLMs fail at, in a gamified manner. Importantly, the increased level of AI assistance (e.g., LLM-generated revision hints) empowers users to create more difficult questions with enhanced perceived creativity of themselves, shedding light on the promises of involving heavier AI assistance in modern evaluation dataset creation procedures. Through a series of 1-hour workshop sessions, we gather CULTURALBENCH-V0.1, a compact yet high-quality evaluation dataset with users' red-teaming attempts, that different families of modern LLMs perform with accuracy ranging from 37.7% to 72.2%, revealing a notable gap in LLMs' multicultural proficiency.
翻译:前沿大语言模型(LLMs)由文化背景存在偏斜的研究者和从业者开发,并基于来源偏斜的数据集训练。然而,现有基准测试方法无法有效评估大语言模型(缺乏)多元文化知识。当前的多元文化评估主要依赖成本高昂且受限的人工标注,或可能过时的互联网资源,因而难以捕捉文化规范的复杂性、动态性和多样性。尽管大语言模型生成的基准测试展现出潜力,但存在传播其旨在衡量的相同偏见的风险。为协同人类标注者的创造力和专业文化知识,以及基于大语言模型的自动化技术的可扩展性和标准化能力,我们提出CulturalTeaming——一种交互式红队测试系统,通过人类-人工智能协作构建真正具有挑战性的评估数据集,以评估大语言模型的多元文化知识,同时提升标注者的能力与体验。研究表明,CulturalTeaming多种模式的人工智能辅助能支持标注者以游戏化方式创建现代大语言模型无法回答的文化问题。值得关注的是,更高程度的人工智能辅助(例如大语言模型生成的修改提示)使用户能够创建更具难度的问题,并增强其感知到的创造力,这揭示了在现代评估数据集创建流程中引入更强人工智能辅助的前景。通过一系列1小时研讨会,我们收集了CULTURALBENCH-V0.1——一个包含用户红队测试尝试的紧凑而高质量评估数据集。现代大语言模型家族在此数据集上的准确率区间为37.7%至72.2%,揭示了模型在多元文化能力方面存在显著差距。