In this work, we build upon the offline reinforcement learning algorithm TD7, which incorporates State-Action Learned Embeddings (SALE) and LAP, and propose a model-free actor-critic algorithm that integrates ensemble Q-networks and a gradient diversity penalty from EDAC. The ensemble Q-networks effectively address the challenge of out-of-distribution actions by introducing penalties that guide the actor network to focus on in-distribution actions. Meanwhile, the gradient diversity penalty encourages diverse Q-value gradients, further suppressing overestimation for out-of-distribution actions. Additionally, our method retains an adjustable behavior cloning (BC) term that directs the actor network toward dataset actions during early training stages, while gradually reducing its influence as the precision of the Q-ensemble improves. These enhancements work synergistically to improve training stability and accuracy. Experimental results on the D4RL MuJoCo benchmarks demonstrate that our algorithm achieves superior convergence speed, stability, and performance compared to existing methods.
翻译:本研究基于融合状态-动作学习嵌入(SALE)与LAP的离线强化学习算法TD7,提出一种集成集成Q网络与EDAC梯度多样性惩罚的无模型演员-评论家算法。集成Q网络通过引入惩罚项引导演员网络聚焦于分布内动作,有效应对分布外动作的挑战。同时,梯度多样性惩罚促进Q值梯度的多样性,进一步抑制对分布外动作的高估。此外,本方法保留了可调节的行为克隆(BC)项,在训练初期引导演员网络趋近数据集动作,并随Q值集成精度的提升逐步减弱其影响。这些改进协同作用,提升了训练稳定性与精度。在D4RL MuJoCo基准测试中的实验结果表明,相较于现有方法,本算法在收敛速度、稳定性与性能方面均表现出显著优势。