Constitutional AI (CAI) aligns language models with explicitly stated normative principles, offering a transparent alternative to implicit alignment through human feedback alone. However, because constitutions are authored by specific groups of people, the resulting models may reflect particular cultural perspectives. We investigate this question by evaluating Anthropic's Claude Sonnet on 55 World Values Survey items, selected for high cross-cultural variance across six value domains and administered as both direct survey questions and naturalistic advice-seeking scenarios. Comparing Claude's responses to country-level data from 90 nations, we find that Claude's value profile most closely resembles those of Northern European and Anglophone countries, but on a majority of items extends beyond the range of all surveyed populations. When users provide cultural context, Claude adjusts its rhetorical framing but not its substantive value positions, with effect sizes indistinguishable from zero across all twelve tested countries. An ablation removing the system prompt increases refusals but does not alter the values expressed when responses are given, and replication on a smaller model (Claude Haiku) confirms the same cultural profile across model sizes. These findings suggest that when a constitution is authored within the same cultural tradition that dominates the training data, constitutional alignment may codify existing cultural biases rather than correct them--producing a value floor that surface-level interventions cannot meaningfully shift. We discuss the compounding nature of this risk and the need for globally representative constitution-authoring processes.
翻译:宪制人工智能(CAI)通过明确陈述的规范性原则来对齐语言模型,相比仅依赖人类反馈的隐式对齐方法,提供了透明化的替代方案。然而,由于宪法由特定群体制定,由此产生的模型可能反映特定文化视角。我们通过使用55个世界价值观调查项目评估Anthropic公司的克劳德·桑内特模型来研究这一问题,这些项目涵盖六个价值维度中具有高度跨文化差异的条目,并以直接问卷调查和自然主义咨询场景两种形式实施。将克劳德模型的响应与90个国家层面的数据进行比较,我们发现其价值分布最接近北欧和英语国家,但在多数条目上超出所有被调查人群的范围。当用户提供文化背景信息时,克劳德模型会调整其修辞框架,但不会改变实质性价值立场,在全部12个测试国家中效应量均趋近于零。移除系统提示的消融实验增加了模型拒绝回答的概率,但未改变实际产生响应时表达的价值取向;在较小模型(克劳德·俳句)上的重复实验验证了跨模型规模的同一文化特征。这些发现表明,当宪法制定者与主导训练数据的文化传统相同时,宪制对齐可能固化而非矫正现有文化偏见——形成表面干预无法实质性改变的价值底线。我们讨论了这种风险的叠加效应以及建立全球代表性宪法制定流程的必要性。