Foundation models are increasingly deployed in socially sensitive domains such as education, mental health, and caregiving, where failures are often cumulative and context-dependent. Existing guardrail approaches -- ranging from training-time alignment to prompting, decoding constraints, and post-hoc moderation -- primarily provide empirical risk reduction rather than enforceable behavioral guarantees, and largely treat safety as a property of individual outputs rather than interaction trajectories. We reframe guardrails as a problem of runtime behavioral control over interaction trajectories, drawing on robotics to introduce formal constructs for constraint enforcement in uncertain, closed-loop systems. We instantiate these ideas in the Grounded Observer framework and apply it across three real-world deployments: small talk, in-home autism therapy, and behavioral de-escalation in schools. Across settings, the framework enables runtime interventions that mitigate drift into undesirable interaction regimes while adapting to diverse social contexts. We discuss extensions to the framework and propose research directions toward stronger guarantees.
翻译:基础模型正日益部署于教育、心理健康和照护等社会敏感领域,在这些领域中,失败往往是累积性的且依赖具体情境。现有的护栏方法——从训练时对齐到提示工程、解码约束和后处理审核——主要提供经验性的风险降低而非可执行的行为保障,并且大多将安全性视为单个输出的属性而非交互轨迹的属性。本文将护栏重新定义为对交互轨迹的运行时行为控制问题,借鉴机器人学引入正式构造,以约束不确定的闭环系统中的行为。我们将这些思想实例化为“具身观察者”框架,并将其应用于三个实际部署场景:闲聊、家庭自闭症治疗以及学校中的行为降级干预。在不同场景中,该框架能够在运行时进行干预,减缓向不良交互模式的漂移,同时适应多样化的社会情境。我们讨论了框架的扩展,并提出了迈向更强保障的研究方向。