Social agents with finitely nested opponent models are vulnerable to manipulation by agents with deeper reasoning and more sophisticated opponent modelling. This imbalance, rooted in logic and the theory of recursive modelling frameworks, cannot be solved directly. We propose a computational framework, $\aleph$-IPOMDP, augmenting model-based RL agents' Bayesian inference with an anomaly detection algorithm and an out-of-belief policy. Our mechanism allows agents to realize they are being deceived, even if they cannot understand how, and to deter opponents via a credible threat. We test this framework in both a mixed-motive and zero-sum game. Our results show the $\aleph$ mechanism's effectiveness, leading to more equitable outcomes and less exploitation by more sophisticated agents. We discuss implications for AI safety, cybersecurity, cognitive science, and psychiatry.
翻译:具有有限嵌套对手模型的社会智能体易受更深层推理和更精细对手建模智能体的操纵。这种植根于逻辑学和递归建模框架理论的失衡无法直接解决。我们提出一种计算框架$\aleph$-IPOMDP,通过异常检测算法和信念外策略增强基于模型的强化学习智能体的贝叶斯推理。该机制使智能体能够意识到自身正在被欺骗(即使无法理解具体欺骗方式),并通过可信威慑来阻止对手。我们在混合动机博弈和零和博弈中测试该框架。实验结果表明$\aleph$机制的有效性,能够带来更公平的博弈结果并减少高级智能体的剥削行为。本文讨论了对人工智能安全、网络安全、认知科学与精神病学的启示。