On the Emergence of Cooperation in the Repeated Prisoner's Dilemma

Using simulations between pairs of $\epsilon$-greedy q-learners with one-period memory, this article demonstrates that the potential function of the stochastic replicator dynamics (Foster and Young, 1990) allows it to predict the emergence of error-proof cooperative strategies from the underlying parameters of the repeated prisoner's dilemma. The observed cooperation rates between q-learners are related to the ratio between the kinetic energy exerted by the polar attractors of the replicator dynamics under the grim trigger strategy. The frontier separating the parameter space conducive to cooperation from the parameter space dominated by defection can be found by setting the kinetic energy ratio equal to a critical value, which is a function of the discount factor, $f(\delta) = \delta/(1-\delta)$, multiplied by a correction term to account for the effect of the algorithms' exploration probability. The gradient at the frontier increases with the distance between the game parameters and the hyperplane that characterizes the incentive compatibility constraint for cooperation under grim trigger. Building on literature from the neurosciences, which suggests that reinforcement learning is useful to understanding human behavior in risky environments, the article further explores the extent to which the frontier derived for q-learners also explains the emergence of cooperation between humans. Using metadata from laboratory experiments that analyze human choices in the infinitely repeated prisoner's dilemma, the cooperation rates between humans are compared to those observed between q-learners under similar conditions. The correlation coefficients between the cooperation rates observed for humans and those observed for q-learners are consistently above $0.8$. The frontier derived from the simulations between q-learners is also found to predict the emergence of cooperation between humans.

翻译：通过模拟配备一期记忆的$\epsilon$-贪婪Q-learning智能体对之间的交互，本文证明随机复制动力学（Foster and Young, 1990）的势函数能够依据重复囚徒困境的基本参数预测无差错合作策略的涌现。Q-learning智能体间观测到的合作率与冷酷触发策略下复制动力学极性吸引子所施加的动能比相关。将动能比设定为临界值即可划分出促进合作的参数空间与主导背叛的参数空间之间的分界线，该临界值是折扣因子$f(\delta) = \delta/(1-\delta)$的函数，并需乘以修正项以考虑算法探索概率的影响。分界线处的梯度随着博弈参数与刻画冷酷触发策略下合作激励相容约束的超平面之间的距离增大而增大。基于神经科学领域关于强化学习有助于理解人类在风险环境中行为的既有文献，本文进一步探讨了为Q-learning智能体推导的分界线能否同样解释人类之间的合作涌现。利用分析无限重复囚徒困境中人类选择的实验室实验元数据，将人类合作率与类似条件下Q-learning智能体观察到的合作率进行比较。人类与Q-learning智能体观测合作率之间的相关系数始终高于$0.8$。由Q-learning智能体模拟推导的分界线同样被证明能够预测人类之间的合作涌现。