Much of the advancement of Multi-Agent Reinforcement Learning (MARL) in imperfect-information games has historically depended on manual iterative refinement of baselines. While foundational families like Counterfactual Regret Minimization (CFR) and Policy Space Response Oracles (PSRO) rest on solid theoretical ground, the design of their most effective variants often relies on human intuition to navigate a vast algorithmic design space. In this work, we propose the use of AlphaEvolve, an evolutionary coding agent powered by large language models, to automatically discover new multiagent learning algorithms. We demonstrate the generality of this framework by evolving novel variants for two distinct paradigms of game-theoretic learning. First, in the domain of iterative regret minimization, we evolve the logic governing regret accumulation and policy derivation, discovering a new algorithm, Volatility-Adaptive Discounted (VAD-)CFR. VAD-CFR employs novel, non-intuitive mechanisms-including volatility-sensitive discounting, consistency-enforced optimism, and a hard warm-start policy accumulation schedule-to outperform state-of-the-art baselines like Discounted Predictive CFR+. Second, in the regime of population based training algorithms, we evolve training-time and evaluation-time meta strategy solvers for PSRO, discovering a new variant, Smoothed Hybrid Optimistic Regret (SHOR-)PSRO. SHOR-PSRO introduces a hybrid meta-solver that linearly blends Optimistic Regret Matching with a smoothed, temperature-controlled distribution over best pure strategies. By dynamically annealing this blending factor and diversity bonuses during training, the algorithm automates the transition from population diversity to rigorous equilibrium finding, yielding superior empirical convergence compared to standard static meta-solvers.
翻译:不完全信息博弈中的多智能体强化学习(MARL)进展,历来在很大程度上依赖于对基线算法的迭代式人工改进。尽管反事实遗憾最小化(CFR)与策略空间响应预言(PSRO)等基础算法族建立在坚实的理论基础上,但其最有效变体的设计,通常需要依靠人类直觉在广阔的算法设计空间中进行探索。在本工作中,我们提出使用由大语言模型驱动的进化编码智能体 AlphaEvolve,来自动发现新的多智能体学习算法。我们通过为两种不同的博弈论学习范式演化出新变体,证明了该框架的通用性。首先,在迭代式遗憾最小化领域,我们演化出控制遗憾累积与策略推导的逻辑,发现了一种新算法——波动率自适应折扣(VAD-)CFR。VAD-CFR 采用了新颖且非直观的机制,包括对波动率敏感的折扣、一致性增强的乐观策略以及硬性热启动策略累积调度,从而超越了 Discounted Predictive CFR+ 等最先进的基线算法。其次,在基于种群的训练算法范畴内,我们为 PSRO 演化出训练时与评估时的元策略求解器,发现了一种新变体——平滑混合乐观遗憾(SHOR-)PSRO。SHOR-PSRO 引入了一种混合元求解器,它将乐观遗憾匹配与一个经过平滑、温度控制的最佳纯策略分布进行线性混合。通过在训练过程中动态退火该混合因子及多样性奖励,该算法实现了从种群多样性到严格均衡寻找的自动化过渡,与标准的静态元求解器相比,获得了更优的经验收敛性。