In dynamic environments, Q-learning is an adaptative rule that provides an estimate (a Q-value) of the continuation value associated with each alternative. A naive policy consists in always choosing the alternative with highest Q-value. We consider a family of Q-based policy rules that may systematically favor some alternatives over others, for example rules that incorporate a leniency bias that favors cooperation. In the spirit of Compte and Postlewaite [2018], we look for equilibrium biases (or Qb-equilibria) within this family of Q-based rules. We examine classic games under various monitoring technologies.
翻译:在动态环境中,Q学习是一种自适应规则,它为每个选项相关的后续价值提供估计值(即Q值)。朴素策略是始终选择Q值最高的选项。我们考虑一类基于Q值的策略规则,这类规则可能系统性地偏好某些选项而非其他选项,例如包含有利于合作的宽容偏好的规则。借鉴Compte和Postlewaite [2018]的思想,我们在基于Q值的规则家族中寻找均衡偏好(或称Qb均衡)。我们研究了不同监测技术下的经典博弈。