Reward-Punishment Reinforcement Learning with Maximum Entropy

We introduce the ``soft Deep MaxPain'' (softDMP) algorithm, which integrates the optimization of long-term policy entropy into reward-punishment reinforcement learning objectives. Our motivation is to facilitate a smoother variation of operators utilized in the updating of action values beyond traditional ``max'' and ``min'' operators, where the goal is enhancing sample efficiency and robustness. We also address two unresolved issues from the previous Deep MaxPain method. Firstly, we investigate how the negated (``flipped'') pain-seeking sub-policy, derived from the punishment action value, collaborates with the ``min'' operator to effectively learn the punishment module and how softDMP's smooth learning operator provides insights into the ``flipping'' trick. Secondly, we tackle the challenge of data collection for learning the punishment module to mitigate inconsistencies arising from the involvement of the ``flipped'' sub-policy (pain-avoidance sub-policy) in the unified behavior policy. We empirically explore the first issue in two discrete Markov Decision Process (MDP) environments, elucidating the crucial advancements of the DMP approach and the necessity for soft treatments on the hard operators. For the second issue, we propose a probabilistic classifier based on the ratio of the pain-seeking sub-policy to the sum of the pain-seeking and goal-reaching sub-policies. This classifier assigns roll-outs to separate replay buffers for updating reward and punishment action-value functions, respectively. Our framework demonstrates superior performance in Turtlebot 3's maze navigation tasks under the ROS Gazebo simulation.

翻译：我们提出“软深度最大痛苦”（softDMP）算法，该算法将长期策略熵的优化整合到奖惩强化学习目标中。其动机在于，超越传统的“最大”和“最小”算子，实现动作值更新中算子更平滑的变化，旨在提升样本效率与鲁棒性。我们还解决了先前深度最大痛苦方法中两个未解决的问题。首先，我们研究了源自惩罚动作值的否定（“翻转”）趋痛子策略如何与“最小”算子协作以有效学习惩罚模块，以及softDMP的平滑学习算子如何为“翻转”技巧提供见解。其次，我们应对了学习惩罚模块时的数据收集挑战，以缓解因“翻转”子策略（避痛子策略）参与统一行为策略而产生的不一致性。我们在两个离散马尔可夫决策过程环境中对第一个问题进行了实证探究，阐明了深度最大痛苦方法的关键进展以及对硬算子进行平滑处理的必要性。针对第二个问题，我们提出了一种基于趋痛子策略与趋痛子策略和达果子策略之和的比率的概率分类器。该分类器将轨迹分配到独立的回放缓冲区，分别用于更新奖励和惩罚动作值函数。我们的框架在ROS Gazebo仿真下的Turtlebot 3迷宫导航任务中展示了优越性能。