In this work, we ask for and answer what makes classical temporal-difference reinforcement learning with epsilon-greedy strategies cooperative. Cooperating in social dilemma situations is vital for animals, humans, and machines. While evolutionary theory revealed a range of mechanisms promoting cooperation, the conditions under which agents learn to cooperate are contested. Here, we demonstrate which and how individual elements of the multi-agent learning setting lead to cooperation. We use the iterated Prisoner's dilemma with one-period memory as a testbed. Each of the two learning agents learns a strategy that conditions the following action choices on both agents' action choices of the last round. We find that next to a high caring for future rewards, a low exploration rate, and a small learning rate, it is primarily intrinsic stochastic fluctuations of the reinforcement learning process which double the final rate of cooperation to up to 80%. Thus, inherent noise is not a necessary evil of the iterative learning process. It is a critical asset for the learning of cooperation. However, we also point out the trade-off between a high likelihood of cooperative behavior and achieving this in a reasonable amount of time. Our findings are relevant for purposefully designing cooperative algorithms and regulating undesired collusive effects.
翻译:本研究探究并回答了经典时序差分强化学习结合ε-贪婪策略为何能产生合作行为。在社交困境情境中,合作对动物、人类及机器均至关重要。尽管演化理论揭示了多种促进合作的机制,但智能体学习合作的具体条件仍存在争议。本文揭示了多智能体学习环境中哪些个体要素如何促成合作。我们以具有一周期记忆的重复囚徒困境为测试平台:两个学习型智能体各自学习一种策略,该策略根据双方上一轮的动作选择来约束当前轮次的动作决策。研究发现,除了对未来奖励的高度重视、低探索率以及小学习率之外,强化学习过程中固有的随机波动是主要因素——它能使最终合作率翻倍至80%。因此,固有噪声并非迭代学习过程中不可避免的副作用,而是促进合作学习的关键要素。然而,我们也指出了高概率合作行为与合理时间内达成合作之间的权衡关系。本研究对有意设计合作型算法以及调控非期望的共谋效应具有重要参考价值。