The Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust Alignment

Static content-based AI value alignment is insufficient for robust alignment under capability scaling, distributional shift, and increasing autonomy. This holds for any approach that treats alignment as optimizing toward a fixed formal value-object, whether reward function, utility function, constitutional principles, or learned preference representation. Three philosophical results create compounding difficulties: Hume's is-ought gap (behavioral data underdetermines normative content), Berlin's value pluralism (human values resist consistent formalization), and the extended frame problem (any value encoding will misfit future contexts that advanced AI creates). RLHF, Constitutional AI, inverse reinforcement learning, and cooperative assistance games each instantiate this specification trap, and their failure modes reflect structural vulnerabilities, not merely engineering limitations that better data or algorithms will straightforwardly resolve. Known workarounds for individual components face mutually reinforcing difficulties when the specification is closed: the moment it ceases to update from the process it governs. Drawing on compatibilist philosophy, the paper argues that behavioral compliance under training conditions does not guarantee robust alignment under novel conditions, and that this gap grows with system capability. For value-laden autonomous systems, known closed approaches face structural vulnerabilities that worsen with capability. The constructive burden shifts to open, developmentally responsive approaches, though whether such approaches can be achieved remains an empirical question.

翻译：静态、基于内容的AI价值对齐方法在面对能力扩展、分布偏移与自主性提升时，不足以实现稳健对齐。这一论断适用于任何将对齐视为针对固定形式化价值对象（无论是奖励函数、效用函数、宪法原则还是习得的偏好表征）进行优化的方法。三个哲学结论共同构成了叠加性困境：休谟的实然-应然鸿沟（行为数据无法充分确定规范性内容）、柏林的价值多元论（人类价值观难以实现一致的形式化），以及扩展的框架问题（任何价值编码在高级AI创造的新情境中都会出现不匹配）。RLHF、宪法AI、逆强化学习与合作性辅助博弈皆实例化了这一规范陷阱，其失效模式反映的是结构性脆弱性，而非单纯可通过更优数据或算法直接解决的工程局限。当规范被封闭——即它不再根据其所治理的过程进行更新时——针对各单一组件的已知应对方案将面临相互强化的困境。本文借鉴相容论哲学，论证了训练条件下的行为合规性并不能保证新异条件下的稳健对齐，且这一差距会随系统能力增长而扩大。对于承载价值的自主系统而言，现有的封闭式方法存在随能力增强而加剧的结构性脆弱性。建设性任务由此转向开放式的、具有发展响应性的方法，尽管此类方法能否实现仍是一个经验性问题。