Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid theoretical ground, the design of their most effective variants often relies on human intuition to navigate a vast algorithmic design space. In this work, we propose the use of AlphaEvolve, an evolutionary coding agent powered by large language models, to automatically discover new multiagent learning algorithms. We demonstrate the generality of this framework by evolving novel variants for two distinct paradigms of game-theoretic learning. First, in the domain of iterative regret minimization, we evolve the logic governing regret accumulation and policy derivation, discovering a new algorithm, Volatility-Adaptive Discounted (VAD-)CFR. VAD-CFR employs novel, non-intuitive mechanisms-including volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start policy accumulation schedule-to outperform state-of-the-art baselines like Discounted Predictive CFR+. Second, in the regime of population based training algorithms, we evolve training-time and evaluation-time meta strategy solvers for PSRO, discovering a new variant, Smoothed Hybrid Optimistic Regret (SHOR-)PSRO. SHOR-PSRO introduces a hybrid meta-solver that linearly blends Optimistic Regret Matching with a smoothed, temperature-controlled distribution over best pure strategies. By dynamically annealing this blending factor and diversity bonuses during training, the algorithm automates the transition from population diversity to rigorous equilibrium finding, yielding superior empirical convergence compared to standard static meta-solvers.
翻译:不完全信息博弈中多智能体强化学习(MARL)的进展,长期以来在很大程度上依赖于对基线算法的手动迭代优化。尽管反事实遗憾最小化(CFR)与策略空间响应预言(PSRO)等基础算法族建立在坚实的理论基础之上,但其最高效变体的设计通常需要依靠人类直觉在广阔的算法设计空间中进行探索。在本研究中,我们提出使用由大型语言模型驱动的进化编码智能体 AlphaEvolve,来自动发现新的多智能体学习算法。我们通过为两种不同的博弈论学习范式演化出新变体,证明了该框架的通用性。首先,在迭代遗憾最小化领域,我们演化了控制遗憾累积与策略推导的逻辑,发现了一种新算法——波动率自适应折扣(VAD-)CFR。VAD-CFR 采用了新颖且非直观的机制,包括波动率敏感的折扣、一致性强化的乐观策略以及硬性热启动策略累积调度,从而超越了 Discounted Predictive CFR+ 等最先进的基线算法。其次,在基于种群的训练算法体系中,我们为 PSRO 演化出了训练时与评估时的元策略求解器,发现了一种新变体——平滑混合乐观遗憾(SHOR-)PSRO。SHOR-PSRO 引入了一种混合元求解器,将乐观遗憾匹配与基于最佳纯策略的平滑温度控制分布进行线性混合。通过在训练过程中动态退火该混合因子与多样性奖励,该算法实现了从种群多样性到严格均衡寻找的自动化过渡,相较于标准的静态元求解器,展现出更优越的经验收敛性。