Existing work on value alignment typically characterizes value relations statically, ignoring how interventions - such as prompting, fine-tuning, or preference optimization - reshape the broader value system. We introduce the Value Alignment Tax (VAT), a framework that measures how alignment-induced changes propagate across interconnected values relative to achieved on-target gain. VAT captures the dynamics of value expression under alignment pressure. Using a controlled scenario-action dataset grounded in Schwartz value theory, we collect paired pre-post normative judgments and analyze alignment effects across models, values, and alignment strategies. Our results show that alignment often produces uneven, structured co-movement among values. These effects are invisible under conventional target-only evaluation, revealing systemic, process-level alignment risks and offering new insights into the dynamics of value alignment in LLMs.
翻译:现有关于价值对齐的研究通常静态地描述价值关系,忽略了干预措施——如提示工程、微调或偏好优化——如何重塑更广泛的价值体系。我们提出了价值对齐税(VAT)这一框架,用于衡量对齐引发的改变如何在相互关联的价值之间传播,相对于实现的目标增益而言。VAT捕捉了在对齐压力下价值表达的动态变化。基于施瓦茨价值理论,我们使用一个受控的场景-行动数据集,收集了配对的前后规范性判断,并分析了不同模型、价值和策略下的对齐效应。我们的结果表明,对齐通常会在价值之间产生不均衡、结构化的协同变动。这些效应在传统的仅针对目标的评估中是不可见的,揭示了系统性的、过程层面的对齐风险,并为LLM中价值对齐的动态机制提供了新的见解。