Although large language models (LLMs) are increasingly implicated in interpersonal and societal decision-making, their ability to navigate explicit conflicts between legitimately different cultural value systems remains largely unexamined. Existing benchmarks predominantly target cultural knowledge (CulturalBench), value prediction (WorldValuesBench), or single-axis bias diagnostics (CDEval); none evaluate how LLMs adjudicate when multiple culturally grounded values directly clash. We address this gap with CCD-Bench, a benchmark that assesses LLM decision-making under cross-cultural value conflict. CCD-Bench comprises 2,182 open-ended dilemmas spanning seven domains, each paired with ten anonymized response options corresponding to the ten GLOBE cultural clusters. These dilemmas are presented using a stratified Latin square to mitigate ordering effects. We evaluate 17 non-reasoning LLMs. Models disproportionately prefer Nordic Europe (mean 20.2 percent) and Germanic Europe (12.4 percent), while options for Eastern Europe and the Middle East and North Africa are underrepresented (5.6 to 5.8 percent). Although 87.9 percent of rationales reference multiple GLOBE dimensions, this pluralism is superficial: models recombine Future Orientation and Performance Orientation, and rarely ground choices in Assertiveness or Gender Egalitarianism (both under 3 percent). Ordering effects are negligible (Cramer's V less than 0.10), and symmetrized KL divergence shows clustering by developer lineage rather than geography. These patterns suggest that current alignment pipelines promote a consensus-oriented worldview that underserves scenarios demanding power negotiation, rights-based reasoning, or gender-aware analysis. CCD-Bench shifts evaluation beyond isolated bias detection toward pluralistic decision making and highlights the need for alignment strategies that substantively engage diverse worldviews.
翻译:尽管大语言模型(LLMs)日益介入人际与社会决策,但其在应对合法不同文化价值体系间的显性冲突方面的能力仍鲜有研究。现有基准主要针对文化知识(CulturalBench)、价值预测(WorldValuesBench)或单维度偏差诊断(CDEval);尚未有评估LLMs在多种文化根基价值直接冲突时如何裁决的研究。为此,我们提出了CCD-Bench,这是一个评估跨文化价值冲突下LLM决策能力的基准。CCD-Bench包含2,182个开放式困境,涵盖七个领域,每个困境配有十个匿名化响应选项,对应十个GLOBE文化集群。这些困境通过分层拉丁方设计呈现,以减少顺序效应。我们评估了17个非推理型LLMs。模型明显偏好北欧(平均20.2%)和日耳曼欧洲(12.4%),而东欧、中东与北非的选项则代表性不足(5.6%至5.8%)。尽管87.9%的决策理由涉及多个GLOBE维度,但这种多元性流于表面:模型倾向于重组未来导向与绩效导向,而很少基于决断力或性别平等主义做出选择(两者均低于3%)。顺序效应可忽略不计(Cramer's V小于0.10),对称化KL散度显示模型按开发者谱系而非地理区域聚类。这些模式表明,当前的校准流程促进了一种共识导向的世界观,未能充分服务于需要权力协商、基于权利推理或性别意识分析的场景。CCD-Bench将评估从孤立的偏差检测转向多元决策,并强调需要开发能实质性融合多元世界观的校准策略。