Multi-agent reinforcement learning (MARL) in real-world use cases may need to adapt to external natural language instructions that interrupt ongoing behavior and conflict with long-horizon objectives. However, conditioning rewards on instructions introduces a fundamental failure mode as Bellman updates couple value estimates across instruction contexts, leading to inconsistent values when instructions interrupt macro-actions. We propose Macro-Action Value Correction for Instruction Compliance (MAVIC), which corrects Bellman backups at instruction boundaries by correcting the incoming instruction objective and restoring the continuation value under the current objective. Unlike reward shaping, MAVIC modifies the bootstrapping target itself, enabling consistent value estimation under stochastic instruction switching within a unified policy. We provide theoretical analysis and an actor-critic implementation, and show that MAVIC achieves high instruction compliance while preserving base task performance in increasingly complex cooperative multi-agent environments.
翻译:现实应用中的多智能体强化学习需要适应外部自然语言指令,这些指令可能中断正在执行的行为并与长期目标产生冲突。但将奖励与指令关联会引发根本性失效模式:贝尔曼更新在跨指令上下文间耦合价值估计,导致指令中断宏动作时出现价值不一致。我们提出宏动作价值校正方法,通过修正指令边界的贝尔曼回溯,在引入新指令目标的同时,恢复当前目标下的延续价值。与奖励塑形不同,本方法直接修改自举目标本身,可在统一策略框架下实现随机指令切换时的价值一致性估计。我们提供了理论分析与演员-评论家实现,证明在日益复杂的协作多智能体环境中,该方法在保持基础任务性能的同时实现了高度指令遵循。