Soft Actor-Critic (SAC) critically depends on its critic network, which typically evaluates a single state-action pair to guide policy updates. Using N-step returns is a common practice to reduce the bias in the target values of the critic. However, using N-step returns can again introduce high variance and necessitates importance sampling, often destabilizing training. Recent algorithms have also explored action chunking-such as direct action repetition and movement primitives-to enhance exploration. In this paper, we propose a Transformer-based Critic Network for SAC that integrates the N-returns framework in a stable and efficient manner. Unlike approaches that perform chunking in the actor network, we feed chunked actions into the critic network to explore potential performance gains. Our architecture leverages the Transformer's ability to process sequential information, facilitating more robust value estimation. Empirical results show that this method not only achieves efficient, stable training but also excels in sparse reward/multi-phase environments-traditionally a challenge for step-based methods. These findings underscore the promise of combining Transformer-based critics with N-returns to advance reinforcement learning performance
翻译:软演员-评论家(SAC)算法高度依赖其评论家网络,该网络通常通过评估单个状态-动作对来指导策略更新。使用N步回报是减少评论家目标值偏差的常见方法,但这种方法可能重新引入高方差并需要进行重要性采样,往往导致训练不稳定。近期算法也探索了动作分块技术——如直接动作重复和运动基元——以增强探索能力。本文提出一种基于Transformer的SAC评论家网络,以稳定高效的方式整合了N步回报框架。与在演员网络中进行分块的方法不同,我们将分块化动作输入评论家网络以探索潜在性能提升。该架构利用Transformer处理序列信息的能力,实现了更稳健的价值估计。实验结果表明,该方法不仅能实现高效稳定的训练,在稀疏奖励/多阶段环境中也表现优异——这对基于步长的方法传统上具有挑战性。这些发现凸显了将基于Transformer的评论家与N步回报相结合在提升强化学习性能方面的潜力。