The Alignment Flywheel: A Governance-Centric Hybrid MAS for Architecture-Agnostic Safety

Multi-agent systems provide mature methodologies for role decomposition, coordination, and normative governance, capabilities that remain essential as increasingly powerful autonomous decision components are embedded within agent-based systems. While learned and generative models substantially expand system capability, their safety behavior is often entangled with training, making it opaque, difficult to audit, and costly to update after deployment. This paper formalizes the Alignment Flywheel as a governance-centric hybrid MAS architecture that decouples decision generation from safety governance. A Proposer, representing any autonomous decision component, generates candidate trajectories, while a Safety Oracle returns raw safety signals through a stable interface. An enforcement layer applies explicit risk policy at runtime, and a governance MAS supervises the Oracle through auditing, uncertainty-driven verification, and versioned refinement. The central engineering principle is patch locality: many newly observed safety failures can be mitigated by updating the governed oracle artifact and its release pipeline rather than retracting or retraining the underlying decision component. The architecture is implementation-agnostic with respect to both the Proposer and the Safety Oracle, and specifies the roles, artifacts, protocols, and release semantics needed for runtime gating, audit intake, signed patching, and staged rollout across distributed deployments. The result is a hybrid MAS engineering framework for integrating highly capable but fallible autonomous systems under explicit, version-controlled, and auditable oversight.

翻译：多智能体系统为角色分解、协调与规范性治理提供了成熟的方法论——当日益强大的自主决策组件嵌入基于智能体的系统时，这些能力依然不可或缺。尽管学习模型与生成模型显著扩展了系统能力，但其安全行为往往与训练过程深度耦合，导致行为不透明、难以审计，且部署后更新成本高昂。本文正式提出"对齐飞轮"这一以治理为中心的混合多智能体系统架构，将决策生成与安全治理解耦。提议者（Proposer）代表任意自主决策组件，负责生成候选轨迹；安全预言机（Safety Oracle）通过稳定接口返回原始安全信号；执行层在运行时应用显式风险策略；治理型多智能体系统则通过审计、不确定性驱动验证及版本化改进来监管预言机。核心工程原则是"补丁局部性"：新观测到的多数安全故障可通过更新被治理的预言机工件及其发布流水线来缓解，而无需撤回或重新训练底层决策组件。该架构对提议者和安全预言机均保持实现无关性，并明确定义了运行时门控、审计接收、签名补丁及分布式部署中分阶段发布所需的角色、工件、协议与发布语义。最终形成一种混合多智能体系统工程框架，用于在显式、版本控制且可审计的监督下集成高能力但易出错的自主系统。