We investigate Nash equilibrium learning in a competitive Markov Game (MG) environment, where multiple agents compete, and multiple Nash equilibria can exist. In particular, for an oligopolistic dynamic pricing environment, exact Nash equilibria are difficult to obtain due to the curse-of-dimensionality. We develop a new model-free method to find approximate Nash equilibria. Gradient-free black box optimization is then applied to estimate $\epsilon$, the maximum reward advantage of an agent unilaterally deviating from any joint policy, and to also estimate the $\epsilon$-minimizing policy for any given state. The policy-$\epsilon$ correspondence and the state to $\epsilon$-minimizing policy are represented by neural networks, the latter being the Nash Policy Net. During batch update, we perform Nash Q learning on the system, by adjusting the action probabilities using the Nash Policy Net. We demonstrate that an approximate Nash equilibrium can be learned, particularly in the dynamic pricing domain where exact solutions are often intractable.
翻译:我们研究了竞争性马尔可夫博弈环境中多智能体竞争及多重纳什均衡存在条件下的纳什均衡学习问题。针对寡头垄断动态定价环境,由于维度灾难导致精确纳什均衡难以求解,我们提出了一种新颖的无模型方法来寻找近似纳什均衡。采用无梯度黑箱优化技术估计参数ε,该参数表征任意智能体单方面偏离联合策略所能获得的最大奖励优势,同时估计任意给定状态下的ε最小化策略。策略-ε对应关系及状态到ε最小化策略的映射通过神经网络表示,后者称为纳什策略网络。在批量更新过程中,我们利用纳什策略网络调整动作概率,对系统执行纳什Q学习。实验证明,在精确解通常难以处理的动态定价领域,该方法能够有效学习近似纳什均衡。