Effective action abstraction is crucial in tackling challenges associated with large action spaces in Imperfect Information Extensive-Form Games (IIEFGs). However, due to the vast state space and computational complexity in IIEFGs, existing methods often rely on fixed abstractions, resulting in sub-optimal performance. In response, we introduce RL-CFR, a novel reinforcement learning (RL) approach for dynamic action abstraction. RL-CFR builds upon our innovative Markov Decision Process (MDP) formulation, with states corresponding to public information and actions represented as feature vectors indicating specific action abstractions. The reward is defined as the expected payoff difference between the selected and default action abstractions. RL-CFR constructs a game tree with RL-guided action abstractions and utilizes counterfactual regret minimization (CFR) for strategy derivation. Impressively, it can be trained from scratch, achieving higher expected payoff without increased CFR solving time. In experiments on Heads-up No-limit Texas Hold'em, RL-CFR outperforms ReBeL's replication and Slumbot, demonstrating significant win-rate margins of $64\pm 11$ and $84\pm 17$ mbb/hand, respectively.
翻译:有效的动作抽象对于解决不完美信息扩展式博弈(IIEFGs)中大动作空间带来的挑战至关重要。然而,由于IIEFGs中巨大的状态空间和计算复杂度,现有方法通常依赖固定抽象机制,导致性能欠优。为此,我们提出RL-CFR——一种用于动态动作抽象的新型强化学习方法。RL-CFR基于我们创新的马尔可夫决策过程(MDP)框架,其中状态对应公共信息,动作表示为指示特定动作抽象的特征向量,奖励定义为所选动作抽象与默认动作抽象的期望收益差。RL-CFR通过强化学习引导的动作抽象构建博弈树,并利用反事实遗憾最小化(CFR)进行策略推导。令人瞩目的是,该方法可从零开始训练,在无需增加CFR求解时间的情况下实现更高期望收益。在无限制一对一德州扑克实验中,RL-CFR分别超越ReBeL复制版本和Slumbot,取得显著的胜率优势:每手牌$64\pm 11$毫大盲注(mbb/hand)和$84\pm 17$毫大盲注(mbb/hand)。