Soft Actor-Critic (SAC) critically depends on its critic network, which typically evaluates a single state-action pair to guide policy updates. Using N-step returns is a common practice to reduce the bias in the target values of the critic. However, using N-step returns can again introduce high variance and necessitates importance sampling, often destabilizing training. Recent algorithms have also explored action chunking-such as direct action repetition and movement primitives-to enhance exploration. In this paper, we propose a Transformer-based Critic Network for SAC that integrates the N-returns framework in a stable and efficient manner. Unlike approaches that perform chunking in the actor network, we feed chunked actions into the critic network to explore potential performance gains. Our architecture leverages the Transformer's ability to process sequential information, facilitating more robust value estimation. Empirical results show that this method not only achieves efficient, stable training but also excels in sparse reward/multi-phase environments-traditionally a challenge for step-based methods. These findings underscore the promise of combining Transformer-based critics with N-returns to advance reinforcement learning performance
翻译:软演员-批评家(SAC)算法的性能关键依赖于其批评家网络,该网络通常通过评估单个状态-动作对来指导策略更新。采用N步回报是减少批评家目标值偏差的常用方法,但这种方法可能引入高方差并需要进行重要性采样,往往导致训练过程不稳定。近期研究也探索了动作分块技术——如直接动作重复和运动基元——以增强探索能力。本文提出一种基于Transformer的SAC批评家网络,以稳定高效的方式整合了N步回报框架。与在演员网络中进行分块的方法不同,我们将分块化动作输入批评家网络以探索潜在的性能提升。该架构利用Transformer处理序列信息的能力,实现了更稳健的价值估计。实验结果表明,该方法不仅能实现高效稳定的训练,在稀疏奖励/多阶段环境中也表现优异——这类环境传统上对基于单步的方法构成挑战。这些发现凸显了将基于Transformer的批评家与N步回报相结合在提升强化学习性能方面的潜力。