When AI agents don't align their actions with human values they may cause serious harm. One way to solve the value alignment problem is by including a human operator who monitors all of the agent's actions. Despite the fact, that this solution guarantees maximal safety, it is very inefficient, since it requires the human operator to dedicate all of his attention to the agent. In this paper, we propose a much more efficient solution that allows an operator to be engaged in other activities without neglecting his monitoring task. In our approach the AI agent requests permission from the operator only for critical actions, that is, potentially harmful actions. We introduce the concept of critical actions with respect to AI safety and discuss how to build a model that measures action criticality. We also discuss how the operator's feedback could be used to make the agent smarter.
翻译:当AI代理的行为与人类价值观不一致时,可能会造成严重伤害。解决价值对齐问题的一种方法是引入人类操作员,由其监控代理的所有行为。尽管这一方案能确保最大安全性,但其效率极低,因为操作员需将全部注意力集中于代理。本文提出了一种更高效的解决方案,允许操作员在不放弃监控任务的同时参与其他活动。在我们的方法中,AI代理仅针对关键行为(即潜在有害行为)向操作员请求许可。我们引入了与AI安全相关的关键行为概念,并探讨如何构建衡量行为关键性的模型。此外,我们还讨论了如何利用操作员的反馈使代理变得更加智能。