Currently, research on Reinforcement learning (RL) can be broadly classified into two categories: online RL and offline RL. Both in online and offline RL, the primary focus of research on the Bellman error lies in the optimization techniques and performance improvement, rather than exploring the inherent structural properties of the Bellman error, such as distribution characteristics. In this study, we analyze the distribution of the Bellman approximation error in both online and offline settings. We find that in the online environment, the Bellman error follows a Logistic distribution, while in the offline environment, the Bellman error follows a constrained Logistic distribution, where the constrained distribution is dependent on the prior policy in the offline data set. Based on this finding, we have improved the MSELoss which is based on the assumption that the Bellman errors follow a normal distribution, and we utilized the Logistic maximum likelihood function to construct $\rm LLoss$ as an alternative loss function. In addition, we observed that the rewards in the offline data set should follow a specific distribution, which would facilitate the achievement of offline objectives. In our numerical experiments, we performed controlled variable corrections on the loss functions of two variants of Soft-Actor-Critic in both online and offline environments. The results confirmed our hypothesis regarding the online and offline settings, we also found that the variance of LLoss is smaller than MSELoss. Our research provides valuable insights for further investigations based on the distribution of Bellman errors.
翻译:当前,强化学习研究主要分为在线强化学习和离线强化学习两类。无论是在线还是离线场景,针对贝尔曼误差的研究重点均集中于优化技术与性能提升,而非探索其内在结构属性(如分布特征)。本研究分析了在线与离线环境下贝尔曼近似误差的分布特性。我们发现,在线环境中贝尔曼误差服从逻辑分布,而离线环境中则服从受限于先验策略的约束逻辑分布(该约束分布与离线数据集中的先验策略相关)。基于此发现,我们改进了基于正态分布假设的MSELoss,并利用逻辑最大似然函数构建了替代损失函数$\rm LLoss$。此外,我们观察到离线数据集中的奖励应服从特定分布,这将有助于实现离线目标。在数值实验中,我们对在线与离线环境下Soft-Actor-Critic两种变体的损失函数进行了控制变量修正。实验结果验证了关于在线与离线场景的假设,同时发现LLoss的方差小于MSELoss。本研究为基于贝尔曼误差分布的后续探索提供了重要参考。