Strategically Robust Multi-Agent Reinforcement Learning with Linear Function Approximation

Provably efficient and robust equilibrium computation in general-sum Markov games remains a core challenge in multi-agent reinforcement learning. Nash equilibrium is computationally intractable in general and brittle due to equilibrium multiplicity and sensitivity to approximation error. We study Risk-Sensitive Quantal Response Equilibrium (RQRE), which yields a unique, smooth solution under bounded rationality and risk sensitivity. We propose \texttt{RQRE-OVI}, an optimistic value iteration algorithm for computing RQRE with linear function approximation in large or continuous state spaces. Through finite-sample regret analysis, we establish convergence and explicitly characterize how sample complexity scales with rationality and risk-sensitivity parameters. The regret bounds reveal a quantitative tradeoff: increasing rationality tightens regret, while risk sensitivity induces regularization that enhances stability and robustness. This exposes a Pareto frontier between expected performance and robustness, with Nash recovered in the limit of perfect rationality and risk neutrality. We further show that the RQRE policy map is Lipschitz continuous in estimated payoffs, unlike Nash, and RQRE admits a distributionally robust optimization interpretation. Empirically, we demonstrate that \texttt{RQRE-OVI} achieves competitive performance under self-play while producing substantially more robust behavior under cross-play compared to Nash-based approaches. These results suggest \texttt{RQRE-OVI} offers a principled, scalable, and tunable path for equilibrium learning with improved robustness and generalization.

翻译：在一般和马尔可夫博弈中实现可证明高效且鲁棒的均衡计算，仍然是多智能体强化学习的核心挑战。纳什均衡在计算上通常是难解的，并且由于均衡的多重性以及对近似误差的敏感性而显得脆弱。我们研究了风险敏感量化响应均衡，该均衡在有界理性和风险敏感性条件下能产生唯一且平滑的解。我们提出了\texttt{RQRE-OVI}，一种用于在大型或连续状态空间中计算具有线性函数逼近的RQRE的乐观值迭代算法。通过有限样本遗憾分析，我们建立了收敛性，并明确刻画了样本复杂度如何随理性和风险敏感性参数变化。遗憾界揭示了一个定量的权衡：提高理性会收紧遗憾，而风险敏感性则引入正则化，从而增强稳定性和鲁棒性。这暴露了期望性能与鲁棒性之间的帕累托前沿，其中纳什均衡在完全理性和风险中性的极限情况下得以恢复。我们进一步证明，与纳什均衡不同，RQRE策略映射在估计收益上是利普希茨连续的，并且RQRE允许一个分布鲁棒优化的解释。实证结果表明，与基于纳什均衡的方法相比，\texttt{RQRE-OVI}在自对弈下实现了有竞争力的性能，同时在交叉对弈下产生了显著更鲁棒的行为。这些结果表明，\texttt{RQRE-OVI}为均衡学习提供了一条原则性、可扩展且可调优的路径，具有改进的鲁棒性和泛化能力。