Std $Q$-target is a conservative, actor-critic, ensemble, $Q$-learning-based algorithm, which is based on a single key $Q$-formula: $Q$-networks standard deviation, which is an "uncertainty penalty", and, serves as a minimalistic solution to the problem of overestimation bias. We implement SQT on top of TD3/TD7 code and test it against the state-of-the-art (SOTA) actor-critic algorithms, DDPG, TD3 and TD7 on seven popular MuJoCo and Bullet tasks. Our results demonstrate SQT's $Q$-target formula superiority over TD3's $Q$-target formula as a conservative solution to overestimation bias in RL, while SQT shows a clear performance advantage on a wide margin over DDPG, TD3, and TD7 on all tasks.
翻译:标准差 $Q$-目标算法是一种基于保守性、演员-评论家、集成、$Q$学习的算法,其核心在于单一的$Q$值公式:$Q$网络的标准差,即“不确定性惩罚项”,作为过估计偏差问题的极简解决方案。我们在TD3/TD7代码基础上实现了SQT,并在七种主流的MuJoCo和Bullet任务中,将其与当前最先进的演员-评论家算法DDPG、TD3及TD7进行对比测试。实验结果表明,SQT的$Q$目标公式在作为强化学习过估计偏差的保守性解决方案时,优于TD3的$Q$目标公式;同时,SQT在所有任务中以较大优势展现出相较于DDPG、TD3和TD7的显著性能优势。