An agent must try new behaviors to explore and improve. In high-stakes environments, an agent that violates safety constraints may cause harm and must be taken offline, curtailing any future interaction. Imitating old behavior is safe, but excessive conservatism discourages exploration. How much behavior change is too much? We show how to use any safe reference policy as a probabilistic regulator for any optimized but untested policy. Conformal calibration on data from the safe policy determines how aggressively the new policy can act, while provably enforcing the user's declared risk tolerance. Unlike conservative optimization methods, we do not assume the user has identified the correct model class nor tuned any hyperparameters. Unlike previous conformal methods, our theory provides finite-sample guarantees even for non-monotonic bounded constraint functions. Our experiments on applications ranging from natural language question answering to biomolecular engineering show that safe exploration is not only possible from the first moment of deployment, but can also improve performance.
翻译:智能体必须尝试新的行为以进行探索和改进。在高风险环境中,违反安全约束的智能体可能造成损害,必须被强制下线,从而终止任何未来的交互。模仿旧行为是安全的,但过度保守会阻碍探索。那么,多大的行为改变才算过度?我们展示了如何将任何安全的参考策略用作任何经过优化但未经测试策略的概率调节器。基于安全策略数据的保形校准决定了新策略可以采取行动的激进程度,同时可证明地强制执行用户声明的风险容忍度。与保守的优化方法不同,我们不假设用户已识别出正确的模型类别或调整了任何超参数。与先前的保形方法不同,我们的理论为非单调有界约束函数提供了有限样本保证。我们在从自然语言问答到生物分子工程等应用上的实验表明,安全探索不仅在部署的初始时刻是可能的,而且还能提高性能。