Large language models increasingly mediate decisions that turn on moral judgement, yet a growing body of evidence shows that their implicit preferences are not culturally neutral. Existing cultural alignment methods either require per-country preference data and fine-tuning budgets or assume white-box access to model internals that commercial APIs do not expose. In this work, we focus on this realistic black-box, public-data-only regime and observe that within-country sociodemographic disagreement, not consensus, is the primary steering signal. We introduce DISCA (Disagreement-Informed Steering for Cultural Alignment), an inference-time method that instantiates each country as a panel of World-Values-Survey-grounded persona agents and converts their disagreement into a bounded, loss-averse logit correction. Across 20 countries and 7 open-weight backbones (2B--70B), DISCA reduces cultural misalignment on MultiTP by 10--24% on the six backbones >=3.8B, and 2--7% on open-ended scenarios, without changing any weights. Our results suggest that inference-time calibration is a scalable alternative to fine-tuning for serving the long tail of global moral preferences.
翻译:大语言模型越来越多地介入涉及道德判断的决策过程,但日益增多的证据表明,其隐含偏好并非文化中立。现有文化对齐方法要么需要针对不同国家的偏好数据和微调预算,要么假设能够白盒访问商业API未开放的系统内部参数。本文聚焦于这种现实的黑盒、仅使用公开数据的场景,并观察到国内社会人口层面的分歧(而非共识)才是主要的引导信号。我们提出DISCA(基于分歧引导的文化对齐方法),这是一种推理时方法,将每个国家实例化为由世界价值观调查数据支撑的角色智能体组成的专家小组,并将其分歧转化为有界、损失规避的logit修正。在涵盖20个国家及7个开放权重基座模型(2B-70B参数规模)的实验中,DISCA在≥3.8B参数的六个基座模型上将MultiTP数据集上的文化偏差降低10-24%,在开放式场景中降低2-7%,且无需更改任何权重。研究结果表明,推理时的校准可作为服务全球化道德偏好长尾分布的微调方法的可扩展替代方案。