How to detect and mitigate deceptive AI systems is an open problem for the field of safe and trustworthy AI. We analyse two algorithms for mitigating deception: The first is based on the path-specific objectives framework where paths in the game that incentivise deception are removed. The second is based on shielding, i.e., monitoring for unsafe policies and replacing them with a safe reference policy. We construct two simple games and evaluate our algorithms empirically. We find that both methods ensure that our agent is not deceptive, however, shielding tends to achieve higher reward.
翻译:如何检测和缓解欺骗性AI系统是安全可信AI领域的一个开放性问题。我们分析了两种用于缓解欺骗行为的算法:第一种基于路径特异性目标框架,通过移除博弈中激励欺骗行为的路径实现;第二种基于防护机制,即监控不安全策略并用安全参考策略进行替换。我们构建了两个简单博弈并进行了实证评估。研究发现两种方法均能确保智能体不产生欺骗行为,但防护机制通常能获得更高奖励。