Deep neural network(DNN) generalization is limited by the over-reliance of current offline reinforcement learning techniques on conservative processing of existing datasets. This method frequently results in algorithms that settle for suboptimal solutions that only adjust to a certain dataset. Similarly, in online reinforcement learning, the previously imposed punitive pessimism also deprives the model of its exploratory potential. Our research proposes a novel framework, Optimistic and Pessimistic Actor Reinforcement Learning (OPARL). OPARL employs a unique dual-actor approach: an optimistic actor dedicated to exploration and a pessimistic actor focused on utilization, thereby effectively differentiating between exploration and utilization strategies. This unique combination in reinforcement learning methods fosters a more balanced and efficient approach. It enables the optimization of policies that focus on actions yielding high rewards through pessimistic utilization strategies, while also ensuring extensive state coverage via optimistic exploration. Experiments and theoretical study demonstrates OPARL improves agents' capacities for application and exploration. In the most tasks of DMControl benchmark and Mujoco environment, OPARL performed better than state-of-the-art methods. Our code has released on https://github.com/yydsok/OPARL
翻译:深度神经网络(DNN)的泛化能力受限于当前离线强化学习技术对现有数据集的保守处理。该方法常导致算法满足于仅适应特定数据集的次优解。同样,在线强化学习中,先前施加的惩罚性悲观主义也剥夺了模型的探索潜力。本研究提出一种新型框架——乐观与悲观Actor强化学习(OPARL)。OPARL采用独特的双Actor机制:一个负责探索的乐观Actor与一个专注于利用的悲观Actor,从而有效区分探索与利用策略。这种强化学习方法中的独特组合促进了更平衡高效的策略优化,既能通过悲观利用策略聚焦高奖励动作,又能通过乐观探索确保广泛的状态覆盖。实验与理论研究表明,OPARL提升了智能体的应用与探索能力。在DMControl基准测试与Mujoco环境的大多数任务中,OPARL表现优于现有最优方法。我们的代码已开源在https://github.com/yydsok/OPARL。