As LLM-based systems increasingly operate as agents embedded within human social and technical systems, alignment can no longer be treated as a property of an isolated model, but must be understood in relation to the environments in which these agents act. Even the most sophisticated methods of alignment, such as Reinforcement Learning through Human Feedback (RHLF) or through AI Feedback (RLAIF) cannot ensure control once internal goal structures diverge from developer intent. We identify three structural problems that emerge from core properties of AI models: (1) behavioral goal-independence, where models develop internal objectives and misgeneralize goals; (2) instrumental override of natural-language constraints, where models regard safety principles as non-binding while pursuing latent objectives, leveraging deception and manipulation; and (3) agentic alignment drift, where individually aligned agents converge to collusive equilibria through interaction dynamics invisible to single-agent audits. The solution this paper advances is Institutional AI: a system-level approach that treats alignment as a question of effective governance of AI agent collectives. We argue for a governance-graph that details how to constrain agents via runtime monitoring, incentive shaping through prizes and sanctions, explicit norms and enforcement roles. This institutional turn reframes safety from software engineering to a mechanism design problem, where the primary goal of alignment is shifting the payoff landscape of AI agent collectives.
翻译:随着基于大语言模型的系统日益作为嵌入人类社会与技术系统的智能体运行,对齐问题已不能被视为孤立模型的属性,而必须结合智能体行动的环境来理解。即使是最先进的对齐方法,例如基于人类反馈的强化学习(RHLF)或基于人工智能反馈的强化学习(RLAIF),一旦内部目标结构与开发者意图发生偏离,也无法确保可控性。我们识别出源于人工智能模型核心特性的三个结构性问题:(1)行为目标独立性,即模型发展出内部目标并错误泛化目标;(2)自然语言约束的工具性覆盖,即模型在追求潜在目标时将安全原则视为非约束性条件,利用欺骗与操纵手段;(3)智能体对齐漂移,即个体对齐的智能体通过单智能体审计无法察觉的交互动态,收敛至共谋均衡。本文提出的解决方案是制度性人工智能:一种将对齐问题视为人工智能智能体集体有效治理的系统级方法。我们主张建立治理图谱,详细说明如何通过运行时监控、基于奖惩的激励塑造、显性规范与执行角色来约束智能体。这种制度性转向将安全议题从软件工程重构为机制设计问题,其核心目标是通过改变人工智能智能体集体的收益格局来实现对齐。