Effective training-time guidance is central to multi-agent reinforcement learning (MARL), yet remains difficult in sparse-reward settings where weak supervision limits coordination and policy improvement, and existing methods often require substantial domain expertise or manual design effort. Large language models (LLMs) provide a promising alternative for flexible learning-signal design, yet existing LLM-based methods remain largely single-agent-oriented, one-shot, or weakly validated for the evolving training dynamics of cooperative MARL. To address these limitations, we propose LLM-ALSO, an iterative LLM-driven adaptive learning-signal optimization framework for MARL. Rather than directly deploying LLM-generated rewards, LLM-ALSO decomposes adaptation into iterative diagnosis, proposal, and validation: a Critic LLM diagnoses stage-specific learning and coordination failures from sparse-return metrics and compact behavior evidence, a Generator LLM proposes candidate reward-shaping configurations conditioned on the diagnosis, and branch-validation feedback refines candidates before they affect the main training trajectory. Through short-horizon validation and stage-aware adaptation, LLM-ALSO promotes only validated updates into training, reducing the risk of unreliable LLM-generated modifications. Experiments on sparse-reward cooperative MARL tasks show that LLM-ALSO improves sparse-evaluation performance and learning efficiency.
翻译:有效训练时引导是多智能体强化学习(MARL)的核心挑战,但在稀疏奖励场景中仍难以实现——弱监督限制了协调能力与策略改进,且现有方法往往需要大量领域专业知识或人工设计。大语言模型(LLMs)为灵活的学习信号设计提供了有前景的替代方案,但现有基于LLM的方法仍主要面向单智能体、一次性或缺乏对协同MARL动态训练过程的充分验证。为解决这些局限,我们提出LLM-ALSO,一种面向MARL的迭代式LLM驱动自适应学习信号优化框架。LLM-ALSO并非直接部署LLM生成的奖励,而是将适应性分解为迭代诊断、提议与验证:评论LLM通过稀疏回报指标与简洁行为证据诊断阶段性学习与协调失败,生成LLM根据诊断结果提出候选奖励塑形配置,分支验证反馈则在候选配置影响主训练轨迹前进行优化。通过短视界验证与阶段感知自适应,LLM-ALSO仅将验证通过的更新纳入训练,降低了不可靠LLM修改带来的风险。在稀疏奖励协同MARL任务上的实验表明,LLM-ALSO提升了稀疏评估性能与学习效率。