In dynamic environments, Q-learning is an automaton that (i) provides estimates (Q-values) of the continuation values associated with each available action; and (ii) follows the naive policy of almost always choosing the action with highest Q-value. We consider a family of automata that are based on Q-values but whose policy may systematically favor some actions over others, for example through a bias that favors cooperation. In the spirit of Compte and Postlewaite [2018], we look for equilibrium biases within this family of Q-based automata. We examine classic games under various monitoring technologies and find that equilibrium biases may strongly foster collusion.
翻译:摘要:在动态环境中,Q学习是一种自动机,它(i)为每个可用动作提供与后续值相关的估计值(Q值),并且(ii)遵循几乎总是选择具有最高Q值的动作的朴素策略。我们考虑一类基于Q值但其策略可能系统性地偏向某些动作而非其他的自动机,例如通过偏向合作的偏差。遵循Compte和Postlewaite [2018]的思路,我们在这类基于Q值的自动机族中寻找均衡偏差。通过研究不同监控技术下的经典博弈,我们发现均衡偏差可能显著促进合谋行为。