Large Language Models (LLMs) are deployed across cultural contexts but often reflect homogenized values inherited from training data. Evaluations of cultural alignment typically rely on direct prompting with survey-style questions, which frequently elicit neutral or safety-aligned responses and fail to capture underlying model preferences. We propose a framework for probing and steering latent cultural representations in LLMs along the two Inglehart--Welzel axes of the World Values Survey (WVS). By translating social value questions into scenario-based behavioral dilemmas, we extract token-level probabilities to measure implicit values and apply activation steering, optionally combined with country-conditioned prompting, to shift model behavior without retraining. Across three open-source LLMs and four target cultures, we find substantial variation in steerability and identify latent entanglement, where interventions along one cultural dimension induce shifts along another. This coupling mirrors correlations in human WVS data and persists across activation, prompt, and hybrid steering. It constrains axis-independent alignment, though general task performance is largely preserved.
翻译:大语言模型(LLMs)被部署于多种文化语境中,但常常反映训练数据中继承的同质化价值观。现有文化对齐评估通常依赖调查式问题的直接提示,这类方法往往引发中性或安全对齐的回应,难以捕获模型潜在的偏好。我们提出一个框架,用于沿世界价值观调查(WVS)中英格尔哈特-韦尔泽尔的两条轴,探测与调控大语言模型中的潜在文化表征。通过将社会价值问题转化为基于场景的行为困境,我们提取词元级概率以测量隐含价值观,并应用激活调控(可结合国家条件提示)在不重新训练的情况下改变模型行为。针对三个开源大语言模型和四种目标文化,我们发现模型可调控性存在显著差异,并识别出潜在纠缠现象——即沿一种文化维度的干预会引发另一维度的偏移。这种耦合关系与人类WVS数据中的相关性一致,并持续存在于激活调控、提示调控及混合调控中。它限制了轴无关对齐,但通用任务性能基本得以保持。