Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 383% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at a negligible cost.
翻译:深度强化学习(DRL)智能体在解决跨领域的复杂决策任务方面表现出色。然而,它们通常需要大量的训练步数和庞大的经验回放缓冲区,导致显著的计算和资源需求。为应对这些挑战,我们引入了一种新颖的理论结果,将Neyman-Rubin潜在结果框架应用于DRL。与大多数专注于限制反事实损失的方法不同,我们在事实损失上建立了一个因果界限,该界限类似于DRL中的同策略损失。此界限通过将历史价值网络输出存储在经验回放缓冲区中计算,有效利用了通常被丢弃的数据。在Atari 2600和MuJoCo领域对多种智能体(如DQN和SAC)进行的大量实验实现了高达383%的奖励比率提升,优于未采用我们提出项的同种智能体,并将经验回放缓冲区大小减少高达96%,以可忽略的成本显著提高了样本效率。