Centralized value learning is often assumed to improve coordination and stability in multi-agent reinforcement learning, yet this assumption is rarely tested under controlled conditions. We directly evaluate it in a fully tabular predator-prey gridworld by comparing independent and centralized Q-learning under explicit embodiment constraints on agent speed and stamina. Across multiple kinematic regimes and asymmetric agent roles, centralized learning fails to provide a consistent advantage and is frequently outperformed by fully independent learning, even under full observability and exact value estimation. Moreover, asymmetric centralized-independent configurations induce persistent coordination breakdowns rather than transient learning instability. By eliminating confounding effects from function approximation and representation learning, our tabular analysis isolates coordination structure as the primary driver of these effects. The results show that increased coordination can become a liability under embodiment constraints, and that the effectiveness of centralized learning is fundamentally regime and role dependent rather than universal.
翻译:集中式价值学习常被认为能提升多智能体强化学习中的协调性与稳定性,但这一假设鲜少在受控条件下得到验证。我们通过在完全表格化的捕食者-猎物网格世界中,比较具有明确速度与耐力具身约束的独立Q学习与集中式Q学习,对此进行了直接评估。在多种运动学机制与非对称智能体角色下,集中式学习未能提供一致优势,且常被完全独立学习超越——即使在完全可观测与精确价值估计条件下亦然。此外,非对称的集中-独立混合配置会引发持续的协调崩溃,而非短暂的学习不稳定性。通过消除函数逼近与表示学习带来的混杂效应,我们的表格化分析将协调结构分离为这些效应的主要驱动因素。结果表明:在具身约束下,增强的协调性可能转化为性能负担;且集中式学习的有效性根本上取决于机制与角色,而非普适性。