Constitutional Value Potentials: reading and steering internal priority margins in language models

A constitution tells a language model what to value, but little tells us whether it does. Adherence is judged from outputs, and output evidence is most fragile on value conflicts, where what matters is not which value a model mentions but which one it is willing to sacrifice. We provide evidence that this arbitration can be read from activations in a structured margin readout. We introduce Constitutional Value Potentials (CVP). For each value we learn a scalar potential from the hidden state: an internal pressure to preserve that value, supervised not by the prompt but by an independent judge's verdict on which value the model's own response actually preserved. The signed difference of two potentials is a priority margin. A constitutional clause becomes the claim that a margin stays positive, and a single monitor score flags when it does not. The monitor predicts conflict violations with AUROC up to 0.95, beats a strong hidden-state probe, and generalizes to held-out synthetic conflicts across three Qwen2.5 scales. The signal appears as the answer begins, from the prompt tail and first response token. Read this early, the same signal reveals whether an adversarial priority hack has actually pushed the model toward a violation, rather than only whether the prompt looks adversarial. The same directions also support intervention tests: under selected steering settings, moving along a value direction shifts judged trade-offs in the intended direction. Together, these results suggest that some constitution-relevant priorities are accessible as activation-space margins, rather than only as output behavior.

翻译：宪章规定语言模型应珍视何种价值，但鲜有方法能有效验证其是否真正践行。对模型遵循情况的判断依赖输出行为，而在价值冲突情境下这种证据最为脆弱——关键不在于模型提及何种价值，而在于其愿意牺牲何种价值。我们通过结构化边际读出的激活状态提供证据表明，这种取舍判断可从模型表征中解读。我们提出宪制价值势能（CVP）：针对每种价值从隐藏状态中学习标量势能（即保持该价值的内部压力），该势能不受提示词的监督，而是由独立评判者根据模型实际响应保留的价值进行裁决。两种势能的符号差值构成优先级边界。宪章条款即要求该边界保持正值的断言，单一监控分数即能标记边界失效时刻。该监控器在AUROC指标上最高可达0.95，优于强基线隐藏状态探测方法，并能泛化至Qwen2.5三种参数量级的合成冲突场景。该信号始于答案生成的初始阶段，源自提示词尾部及首个响应令牌。利用这种早期判读，同一信号能够揭示对抗性优先级攻击是否真正驱使模型走向违规（而非仅判断提示词是否具有对抗性）。相同的方向向量还支持干预测试：在选定引导设置下，沿价值方向移动可促使权衡评估向预期方向偏移。综合而言，这些结果表明部分与宪章相关的优先级可作为激活空间边界而非仅输出行为被获取。