AI alignment is growing in importance, yet current approaches suffer from a critical structural flaw that entangles the safety objectives with the agent's policy. Methods such as Reinforcement Learning from Human Feedback and Direct Preference Optimization create opaque, single-use alignment artifacts, which we term Alignment Waste. We propose Interactionless Inverse Reinforcement Learning to decouple alignment artifact learning from policy optimization, producing an inspectable, editable, and model-agnostic reward model. Additionally, we introduce the Alignment Flywheel, a human-in-the-loop lifecycle that iteratively hardens the reward model through automated audits and refinement. This architecture transforms safety from a disposable expense into a durable, verifiable engineering asset.
翻译:人工智能对齐的重要性日益凸显,但现有方法存在一个关键的结构性缺陷,即将安全目标与智能体策略相纠缠。基于人类反馈的强化学习和直接偏好优化等方法会产生不透明、一次性使用的对齐产物,我们称之为“对齐浪费”。我们提出无交互式逆向强化学习,将对齐产物的学习与策略优化解耦,从而生成一个可检查、可编辑且与模型无关的奖励模型。此外,我们引入了“对齐飞轮”——一种人在回路的生命周期流程,通过自动化审计与精炼迭代强化奖励模型。该架构将安全性从一次性消耗品转变为可持久化、可验证的工程资产。