The Specification Trap: Why Static Value Alignment Alone Cannot Produce Robust Alignment

from arxiv, 24 pages. First in a six-paper program on AI alignment. Establishes a structural ceiling on closed specification (RLHF, Constitutional AI, IRL, assistance games); claims robust alignment under scaling/shift/autonomy requires open, process-coupled specification. v3: thesis sharpened to closure; tool/autonomous distinction added; empirical signatures for open specification; six-paper structure

Static content-based AI value alignment cannot produce robust alignment under capability scaling, distributional shift, and increasing autonomy. This holds for any approach that treats alignment as optimizing toward a fixed formal value-object, whether reward function, utility function, constitutional principles, or learned preference representation. The limitation arises from three philosophical results: Hume's is-ought gap (behavioral data cannot entail normative conclusions), Berlin's value pluralism (human values are irreducibly plural and incommensurable), and the extended frame problem (any value encoding will misfit future contexts that advanced AI creates). RLHF, Constitutional AI, inverse reinforcement learning, and cooperative assistance games each instantiate this specification trap, and their failure modes are structural, not engineering limitations. Two proposed escape routes (meta-preferences and moral realism) relocate the trap rather than exit it. Continual updating represents a genuine direction of escape, not because current implementations succeed, but because the trap activates at the point of closure: the moment a specification ceases to update from the process it governs. Drawing on Fischer and Ravizza's compatibilist theory, behavioral compliance does not constitute alignment. There is a principled distinction between simulated value-following and genuine reasons-responsiveness, and closed specification methods cannot produce the latter. The specification trap establishes a ceiling on static approaches, not on specification itself, but this ceiling becomes safety-critical at the capability frontier. The alignment problem must be reframed from static value specification to open specification: systems whose value representations remain responsive to the processes they govern.

翻译：基于静态内容的AI价值对齐无法在能力扩展、分布偏移和自主性增强的情况下实现稳健对齐。这一结论适用于所有将对齐视为优化固定形式价值对象（无论是奖励函数、效用函数、宪法原则还是习得偏好表征）的方法。该局限性源于三个哲学结论：休谟的"实然-应然"鸿沟（行为数据无法蕴含规范性结论）、柏林的"价值多元论"（人类价值具有不可约的多元性和不可通约性），以及扩展的框架问题（任何价值编码都将与高级AI创造的未来情境错位）。RLHF、宪法式AI、逆向强化学习和合作辅助游戏均例证了这一规范陷阱，其失败模式是结构性的，而非工程限制。两种拟议的逃脱路径（元偏好与道德实在论）只是重新定位了陷阱而非真正脱离。持续更新代表一种有前景的逃脱方向，并非因为当前实现成功，而是因为陷阱在封闭点被激活：当规范停止从其治理过程中更新时。借鉴费希尔和拉维察的相容论理论，行为遵从并不构成对齐。模拟的价值遵循与真正的理由回应之间存在原则性区别，封闭规范方法无法产生后者。规范陷阱对静态方法施加了上限——并非对规范本身——但在能力前沿这个上限变得安全关键。对齐问题必须从静态价值规范重新框架为开放规范：系统的价值表征保持对治理过程的可响应性。