Expert imitation, behavioral diversity, and fairness preferences give rise to preferences in sequential decision making domains that do not decompose additively across time. We introduce the class of convex Markov games that allow general convex preferences over occupancy measures. Despite infinite time horizon and strictly higher generality than Markov games, pure strategy Nash equilibria exist under strict convexity. Furthermore, equilibria can be approximated efficiently by performing gradient descent on an upper bound of exploitability. Our experiments imitate human choices in ultimatum games, reveal novel solutions to the repeated prisoner's dilemma, and find fair solutions in a repeated asymmetric coordination game. In the prisoner's dilemma, our algorithm finds a policy profile that deviates from observed human play only slightly, yet achieves higher per-player utility while also being three orders of magnitude less exploitable.
翻译:专家模仿、行为多样性与公平性偏好在序列决策领域中催生了无法随时间可加分解的偏好。我们引入凸马尔可夫博弈这一类别,其允许对占用测度施加一般凸偏好。尽管具有无限时间范围且严格比马尔可夫博弈更具一般性,在严格凸性条件下纯策略纳什均衡仍然存在。此外,通过对可剥削性上界执行梯度下降,可有效逼近均衡。我们的实验模仿了人类在最后通牒博弈中的选择,揭示了重复囚徒困境的新颖解,并在重复非对称协调博弈中找到了公平解。在囚徒困境中,我们的算法发现了一种策略组合,其与观察到的人类博弈行为仅有微小偏差,却实现了更高的单玩家效用,同时可剥削性降低了三个数量级。