规范陷阱：为何基于内容的AI价值对齐无法实现稳健对齐 (The Specification Trap: Why Content-Based AI Value Alignment Cannot Produce Robust Alignment)

from arxiv, ~6,000 words, 19 pages. First in a five-paper research program on AI alignment. Establishes structural ceiling on content-based approaches (RLHF, Constitutional AI, IRL, assistance games); does not claim alignment is impossible, claims robust alignment under scaling/shift/autonomy requires process-based alternatives

I argue that content-based AI value alignment--any approach that treats alignment as optimizing toward a formal value-object (reward function, utility function, constitutional principles, or learned preference representation)--cannot, by itself, produce robust alignment under capability scaling, distributional shift, and increasing autonomy. This limitation arises from three philosophical results: Hume's is-ought gap (behavioral data cannot entail normative conclusions), Berlin's value pluralism (human values are irreducibly plural and incommensurable), and the extended frame problem (any value encoding will misfit future contexts that advanced AI creates). I show that RLHF, Constitutional AI, inverse reinforcement learning, and cooperative assistance games each instantiate this specification trap, and that their failure modes are structural, not engineering limitations. Proposed escape routes--continual updating, meta-preferences, moral realism--relocate the trap rather than exit it. Drawing on Fischer and Ravizza's compatibilist theory, I argue that behavioral compliance does not constitute alignment: there is a principled distinction between simulated value-following and genuine reasons-responsiveness, and specification-based methods cannot produce the latter. The specification trap establishes a ceiling on content-based approaches, not their uselessness--but this ceiling becomes safety-critical at the capability frontier. The alignment problem must be reframed from value specification to value emergence.

翻译：我认为，基于内容的AI价值对齐——任何将对齐视为向形式价值目标（奖励函数、效用函数、宪法原则或习得的偏好表征）优化的方法——本身无法在能力扩展、分布偏移和自主性增强的条件下实现稳健对齐。这一局限性源于三个哲学结果：休谟的实然-应然鸿沟（行为数据无法推导出规范性结论）、柏林的多元价值论（人类价值具有不可简化的多元性与不可通约性），以及扩展框架问题（任何价值编码都将无法适应先进AI所创造的未来情境）。我证明，RLHF、宪法AI、逆强化学习和协作辅助博弈均实例化了这一规范陷阱，且它们的失效模式是结构性的，而非工程局限。提出的解决路径——持续更新、元偏好、道德实在论——仅转移了陷阱而非真正摆脱。借鉴费舍尔和拉维扎的相容论理论，我认为行为合规并不构成对齐：模拟价值遵循与真正的理性响应之间存在原则性区别，而基于规范的方法无法产生后者。规范陷阱为基于内容的方法设定了上限，而非否定其效用——但在能力前沿领域，这一上限将变得至关重要。必须将对齐问题从价值规范重构为价值涌现。