The Specification Trap: Why Static Value Alignment Alone Is Insufficient for Robust Alignment

Static content-based AI value alignment is insufficient for robust alignment under capability scaling, distributional shift, and increasing autonomy. This holds for any approach that treats alignment as optimizing toward a fixed formal value-object, whether reward function, utility function, constitutional principles, or learned preference representation. Three philosophical results create compounding difficulties: Hume's is-ought gap (behavioral data underdetermines normative content), Berlin's value pluralism (human values resist consistent formalization), and the extended frame problem (any value encoding will misfit future contexts that advanced AI creates). RLHF, Constitutional AI, inverse reinforcement learning, and cooperative assistance games each instantiate this specification trap, and their failure modes reflect structural vulnerabilities, not merely engineering limitations that better data or algorithms will straightforwardly resolve. Known workarounds for individual components face mutually reinforcing difficulties when the specification is closed: the moment it ceases to update from the process it governs. Drawing on compatibilist philosophy, the paper argues that behavioral compliance under training conditions does not guarantee robust alignment under novel conditions, and that this gap grows with system capability. For value-laden autonomous systems, known closed approaches face structural vulnerabilities that worsen with capability. The constructive burden shifts to open, developmentally responsive approaches, though whether such approaches can be achieved remains an empirical question.

翻译：基于静态内容的AI价值对齐，在能力扩展、分布偏移和自主性提升的背景下，不足以实现鲁棒对齐。这一结论适用于任何将对齐视为优化固定形式化价值对象的方法，无论该对象是奖励函数、效用函数、宪法原则，还是习得的偏好表征。三个哲学结果共同构成了叠加性困境：休谟的"实然-应然"鸿沟（行为数据无法充分确定规范性内容）、伯林的价值多元论（人类价值难以实现一致的形式化），以及扩展的框架问题（任何价值编码都将与高级AI创造的未来情境不匹配）。RLHF、宪法AI、逆强化学习和协作辅助游戏均实例化了这一规范陷阱，其失败模式反映了结构性脆弱性，而非单纯通过更好的数据或算法即可解决的工程限制。已知针对各组成部分的变通方案，在规范封闭时面临相互强化的困境：一旦其不再受所治理过程的更新影响。基于相容论哲学，本文认为训练条件下的行为合规性，并不能保证新环境下的鲁棒对齐，且这一差距随着系统能力的提升而扩大。对于承载价值判断的自主系统，现有封闭方法存在随能力增强而恶化的结构性脆弱性。建设性责任因而转向开放式的、发展响应的路径——尽管此类路径是否可行仍有待实证检验。