Generative AI systems are increasingly assisting and acting on behalf of end users in practical settings, from digital shopping assistants to next-generation autonomous cars. In this context, safety is no longer about blocking harmful content, but about preempting downstream hazards like financial or physical harm. Yet, most AI guardrails continue to rely on output classification based on labeled datasets and human-specified criteria,making them brittle to new hazardous situations. Even when unsafe conditions are flagged, this detection offers no path to recovery: typically, the AI system simply refuses to act--which is not always a safe choice. In this work, we argue that agentic AI safety is fundamentally a sequential decision problem: harmful outcomes arise from the AI system's continually evolving interactions and their downstream consequences on the world. We formalize this through the lens of safety-critical control theory, but within the AI model's latent representation of the world. This enables us to build predictive guardrails that (i) monitor an AI system's outputs (actions) in real time and (ii) proactively correct risky outputs to safe ones, all in a model-agnostic manner so the same guardrail can be wrapped around any AI model. We also offer a practical training recipe for computing such guardrails at scale via safety-critical reinforcement learning. Our experiments in simulated driving and e-commerce settings demonstrate that control-theoretic guardrails can reliably steer LLM agents clear of catastrophic outcomes (from collisions to bankruptcy) while preserving task performance, offering a principled dynamic alternative to today's flag-and-block guardrails.
翻译:生成式人工智能系统正日益在数字购物助手到下一代自动驾驶汽车等实际场景中辅助并代表终端用户行动。在此背景下,安全性已不再局限于屏蔽有害内容,而是需要预防金融损失或人身伤害等下游风险。然而,大多数人工智能护栏仍依赖于基于标注数据集和人工设定标准的输出分类方法,这使其在面对新型危险情境时显得脆弱。即使系统检测到不安全状态,这种检测也未能提供恢复路径:通常人工智能系统仅会拒绝执行操作——而这并非总是安全的选择。本研究提出,具身化人工智能安全本质上是一个序列决策问题:有害后果源于人工智能系统持续演化的交互行为及其对现实世界产生的下游影响。我们通过安全关键控制理论的视角对此进行形式化建模,但将其置于人工智能模型对世界的隐式表征空间内。这使得我们能够构建具备预测能力的护栏系统,其能够(1)实时监测人工智能系统的输出(行为),(2)主动将风险输出校正为安全输出,且整个过程保持模型无关性——同一护栏可适配任何人工智能模型。我们还提供了一种通过安全关键强化学习进行大规模护栏计算的实用训练方案。在模拟驾驶和电子商务场景中的实验表明,基于控制理论的护栏能可靠引导大语言模型智能体规避灾难性后果(从碰撞事故到资产破产),同时保持任务性能,为当前“标记-阻断”式护栏提供了具有理论依据的动态替代方案。