Reinforcement learning (RL) is a dominant paradigm for training autonomous agents, yet these agents often exhibit poor generalization, failing to adapt to scenarios not seen during training. In this work, we identify a fundamental cause of this brittleness, a phenomenon which we term "gradient coupling." We hypothesize that in complex agentic tasks, the high similarity between distinct states leads to destructive interference between gradients. Specifically, a gradient update that reinforces an optimal action in one state can inadvertently increase the likelihood of a suboptimal action in a similar, yet different, state. To solve this, we propose a novel objective where the actor is trained to simultaneously function as a classifier that separates good and bad actions. This auxiliary pressure compels the model to learn disentangled embeddings for positive and negative actions, which mitigates negative gradient interference and improve the generalization performance. Extensive experiments demonstrate the effectiveness of our method.
翻译:强化学习(RL)是训练自主智能体的主流范式,然而这些智能体通常表现出较差的泛化能力,难以适应训练中未见的场景。在本研究中,我们揭示了导致这种脆弱性的一个根本原因,并将其称为“梯度耦合”现象。我们假设,在复杂的智能体任务中,不同状态之间的高度相似性会导致梯度间的破坏性干扰。具体而言,在一个状态中强化最优动作的梯度更新,可能会无意中增加在相似但不同的状态中采取次优动作的概率。为解决这一问题,我们提出了一种新颖的目标函数,其中行动者被训练为同时充当区分好动作与坏动作的分类器。这种辅助压力迫使模型学习正负动作的解耦嵌入表示,从而减轻负梯度干扰并提升泛化性能。大量实验验证了我们方法的有效性。