Modern reinforcement learning (RL) can be categorized into online and offline variants. As a pivotal aspect of both online and offline RL, current research on the Bellman equation revolves primarily around optimization techniques and performance enhancement rather than exploring the inherent structural properties of the Bellman error, such as its distribution characteristics. This study investigates the distribution of the Bellman approximation error in both online and offline settings through iterative exploration of the Bellman equation. We observed that both in online RL and offline RL, the Bellman error conforms to a Logistic distribution. Building upon this discovery, this study employed the Logistics maximum likelihood function (LLoss) as an alternative to the commonly used MSE Loss, assuming that Bellman errors adhere to a normal distribution. We validated our hypotheses through extensive numerical experiments across diverse online and offline environments. In particular, we applied corrections to the loss function across various baseline algorithms and consistently observed that the loss function with Logistic corrections outperformed the MSE counterpart significantly. Additionally, we conducted Kolmogorov-Smirnov tests to confirm the reliability of the Logistic distribution. This study's theoretical and empirical insights provide valuable groundwork for future investigations and enhancements centered on the distribution of Bellman errors.
翻译:现代强化学习(RL)可分为在线与离线两种变体。作为在线与离线RL的关键环节,当前关于贝尔曼方程的研究主要围绕优化技术与性能提升展开,而非探索贝尔曼误差的内在结构性质(如分布特征)。本研究通过迭代求解贝尔曼方程,分别探讨了在线与离线场景下贝尔曼近似误差的分布特性。我们发现,无论在在线RL还是离线RL中,贝尔曼误差均服从逻辑斯蒂分布。基于这一发现,本研究采用逻辑斯蒂最大似然函数(LLoss)替代常用的均方误差损失(MSE Loss),后者假设贝尔曼误差服从正态分布。我们通过跨多种在线与离线环境的广泛数值实验验证了假设。特别地,我们在多种基线算法上对损失函数进行修正,并一致观测到逻辑斯蒂修正后的损失函数显著优于MSE对应版本。此外,我们采用柯尔莫戈洛夫-斯米尔诺夫检验验证了逻辑斯蒂分布的可靠性。本研究的理论与实证结论为未来围绕贝尔曼误差分布特性的研究探索与性能优化奠定了重要基础。