Social agents with finitely nested opponent models are vulnerable to manipulation by agents with deeper recursive capabilities. This imbalance, rooted in logic and the theory of recursive modelling frameworks, cannot be solved directly. We propose a computational framework called $\aleph$-IPOMDP, which augments the Bayesian inference of model-based RL agents with an anomaly detection algorithm and an out-of-belief policy. Our mechanism allows agents to realize that they are being deceived, even if they cannot understand how, and to deter opponents via a credible threat. We test this framework in both a mixed-motive and a zero-sum game. Our results demonstrate the $\aleph$-mechanism's effectiveness, leading to more equitable outcomes and less exploitation by more sophisticated agents. We discuss implications for AI safety, cybersecurity, cognitive science, and psychiatry.
翻译:具有有限嵌套对手模型的社会智能体,容易受到具备更深递归能力智能体的操纵。这种根植于逻辑和递归建模框架理论的不平衡性,无法直接解决。我们提出了一个名为$\aleph$-IPOMDP的计算框架,它通过一个异常检测算法和一个信念外策略,增强了基于模型的强化学习智能体的贝叶斯推断能力。我们的机制使得智能体能够意识到自己正在被欺骗,即使它们无法理解欺骗是如何发生的,并能通过可信的威胁来威慑对手。我们在混合动机和零和博弈中测试了该框架。我们的结果证明了$\aleph$机制的有效性,它能带来更公平的结果,并减少智能体被更复杂对手剥削的情况。我们讨论了该框架对人工智能安全、网络安全、认知科学和精神病学的影响。