从拒绝到恢复：一种基于控制理论的生成式人工智能护栏方法 (From Refusal to Recovery: A Control-Theoretic Approach to Generative AI Guardrails)

Generative AI systems are increasingly assisting and acting on behalf of end users in practical settings, from digital shopping assistants to next-generation autonomous cars. In this context, safety is no longer about blocking harmful content, but about preempting downstream hazards like financial or physical harm. Yet, most AI guardrails continue to rely on output classification based on labeled datasets and human-specified criteria,making them brittle to new hazardous situations. Even when unsafe conditions are flagged, this detection offers no path to recovery: typically, the AI system simply refuses to act--which is not always a safe choice. In this work, we argue that agentic AI safety is fundamentally a sequential decision problem: harmful outcomes arise from the AI system's continually evolving interactions and their downstream consequences on the world. We formalize this through the lens of safety-critical control theory, but within the AI model's latent representation of the world. This enables us to build predictive guardrails that (i) monitor an AI system's outputs (actions) in real time and (ii) proactively correct risky outputs to safe ones, all in a model-agnostic manner so the same guardrail can be wrapped around any AI model. We also offer a practical training recipe for computing such guardrails at scale via safety-critical reinforcement learning. Our experiments in simulated driving and e-commerce settings demonstrate that control-theoretic guardrails can reliably steer LLM agents clear of catastrophic outcomes (from collisions to bankruptcy) while preserving task performance, offering a principled dynamic alternative to today's flag-and-block guardrails.

翻译：生成式人工智能系统正日益在数字购物助手到下一代自动驾驶汽车等实际场景中辅助并代表终端用户行动。在此背景下，安全性已不再局限于屏蔽有害内容，而是需要预防金融损失或人身伤害等下游风险。然而，大多数人工智能护栏仍依赖于基于标注数据集和人工设定标准的输出分类方法，这使其在面对新型危险情境时显得脆弱。即使系统检测到不安全状态，这种检测也未能提供恢复路径：通常人工智能系统仅会拒绝执行操作——而这并非总是安全的选择。本研究提出，具身化人工智能安全本质上是一个序列决策问题：有害后果源于人工智能系统持续演化的交互行为及其对现实世界产生的下游影响。我们通过安全关键控制理论的视角对此进行形式化建模，但将其置于人工智能模型对世界的隐式表征空间内。这使得我们能够构建具备预测能力的护栏系统，其能够（1）实时监测人工智能系统的输出（行为），（2）主动将风险输出校正为安全输出，且整个过程保持模型无关性——同一护栏可适配任何人工智能模型。我们还提供了一种通过安全关键强化学习进行大规模护栏计算的实用训练方案。在模拟驾驶和电子商务场景中的实验表明，基于控制理论的护栏能可靠引导大语言模型智能体规避灾难性后果（从碰撞事故到资产破产），同时保持任务性能，为当前“标记-阻断”式护栏提供了具有理论依据的动态替代方案。

相关内容

关注 7093

人工智能杂志AI(Artificial Intelligence)是目前公认的发表该领域最新研究成果的主要国际论坛。该期刊欢迎有关AI广泛方面的论文，这些论文构成了整个领域的进步，也欢迎介绍人工智能应用的论文，但重点应该放在新的和新颖的人工智能方法如何提高应用领域的性能，而不是介绍传统人工智能方法的另一个应用。关于应用的论文应该描述一个原则性的解决方案，强调其新颖性，并对正在开发的人工智能技术进行深入的评估。官网地址：http://dblp.uni-trier.de/db/journals/ai/

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日