Recent empirical results have demonstrated that training large language models (LLMs) with negative-only feedback can match or exceed standard reinforcement learning from human feedback (RLHF). Negative Sample Reinforcement achieves parity with PPO on mathematical reasoning; Distributional Dispreference Optimization trains effectively using only dispreferred samples; and Constitutional AI outperforms pure RLHF on harmlessness benchmarks. Yet no unified theoretical account explains why negative signals are so effective. This paper proposes such an account: positive preferences and negative constraints are structurally asymmetric. Positive preferences ("which is better") encode continuously coupled, context-dependent human values that cannot be exhaustively specified -- leading models to learn surface correlates such as agreement with the user (sycophancy). Negative constraints ("what is wrong") encode discrete, finite, independently verifiable prohibitions that can converge to a stable boundary. This asymmetry -- rooted in Popper's falsification logic and the epistemology of negative knowledge -- explains both the sycophancy failure of preference-based RLHF and the surprising effectiveness of negative-signal methods. We argue that alignment research should shift its center of gravity from "learning what humans prefer" to "learning what humans reject," and offer testable predictions for this framework.
翻译:最近的实证研究表明,仅使用负面反馈训练大型语言模型(LLMs)可以达到甚至超越基于人类反馈的强化学习(RLHF)标准方法的效果。负面样本强化学习在数学推理任务上达到与PPO相当的水平;分布性非偏好优化仅使用非偏好样本即可有效训练;宪法AI在无害性基准测试中优于纯RLHF方法。然而,目前尚缺乏统一的理论解释为何负面信号如此有效。本文提出一个理论框架:正面偏好与负面约束在结构上具有不对称性。正面偏好("哪个更好")编码了连续耦合、依赖语境且无法穷尽描述的人类价值观——这导致模型学习到诸如迎合用户(谄媚性)等表面关联特征。负面约束("什么是错误的")则编码了离散、有限且可独立验证的禁令,能够收敛到稳定边界。这种不对称性——其根源可追溯至波普尔的证伪逻辑与负面知识认识论——既解释了基于偏好的RLHF产生谄媚性缺陷的原因,也揭示了负面信号方法惊人有效性的内在机制。我们认为,对齐研究的重心应从"学习人类偏好什么"转向"学习人类拒绝什么",并为此框架提供了可检验的预测。