Deep neural network(DNN) generalization is limited by the over-reliance of current offline reinforcement learning techniques on conservative processing of existing datasets. This method frequently results in algorithms that settle for suboptimal solutions that only adjust to a certain dataset. Similarly, in online reinforcement learning, the previously imposed punitive pessimism also deprives the model of its exploratory potential. Our research proposes a novel framework, Optimistic and Pessimistic Actor Reinforcement Learning (OPARL). OPARL employs a unique dual-actor approach: an optimistic actor dedicated to exploration and a pessimistic actor focused on utilization, thereby effectively differentiating between exploration and utilization strategies. This unique combination in reinforcement learning methods fosters a more balanced and efficient approach. It enables the optimization of policies that focus on actions yielding high rewards through pessimistic utilization strategies, while also ensuring extensive state coverage via optimistic exploration. Experiments and theoretical study demonstrates OPARL improves agents' capacities for application and exploration. In the most tasks of DMControl benchmark and Mujoco environment, OPARL performed better than state-of-the-art methods. Our code has released on https://github.com/yydsok/OPARL
翻译:深度神经网络(DNN)的泛化能力受限于当前离线强化学习技术对现有数据集的保守处理。此类方法常导致算法陷入仅适配特定数据集的次优解。同样地,在线强化学习中预先施加的惩罚性悲观约束也剥夺了模型的探索潜力。本研究提出新颖框架——乐观悲观角色强化学习(OPARL)。OPARL采用独特的双角色机制:乐观角色专司探索,悲观角色聚焦利用,从而有效区分探索与利用策略。这种强化学习方法中的独特组合促进了更均衡高效的方案,既通过悲观利用策略优化聚焦高回报动作的策略,又借由乐观探索确保广泛的状态覆盖。实验与理论研究表明,OPARL提升了智能体的应用与探索能力。在DMControl基准测试与Mujoco环境的大多数任务中,OPARL性能优于当前最优方法。我们的代码已开源至https://github.com/yydsok/OPARL