Modern reinforcement learning (RL) can be categorized into online and offline variants. As a pivotal aspect of both online and offline RL, current research on the Bellman equation revolves primarily around optimization techniques and performance enhancement rather than exploring the inherent structural properties of the Bellman error, such as its distribution characteristics. This study investigates the distribution of the Bellman approximation error through iterative exploration of the Bellman equation with the observation that the Bellman error approximately follows the Logistic distribution. Based on this, we proposed the utilization of the Logistic maximum likelihood function (LLoss) as an alternative to the commonly used mean squared error (MSELoss) that assumes a Normal distribution for Bellman errors. We validated the hypotheses through extensive numerical experiments across diverse online and offline environments. In particular, we applied the Logistic correction to loss functions in various RL baseline methods and observed that the results with LLoss consistently outperformed the MSE counterparts. We also conducted the Kolmogorov-Smirnov tests to confirm the reliability of the Logistic distribution. Moreover, our theory connects the Bellman error to the proportional reward scaling phenomenon by providing a distribution-based analysis. Furthermore, we applied the bias-variance decomposition for sampling from the Logistic distribution. The theoretical and empirical insights of this study lay a valuable foundation for future investigations and enhancements centered on the distribution of Bellman error.
翻译:摘要:现代强化学习可分为在线和离线两类变体。作为在线与离线强化学习的关键组成部分,当前关于贝尔曼方程的研究主要集中于优化技术与性能提升,而非探索贝尔曼误差的本质结构特性(如分布特征)。本研究通过迭代探索贝尔曼方程,观察到贝尔曼近似误差近似服从逻辑分布。基于此,我们提出采用逻辑最大似然函数作为常用均方误差的替代方案——后者假定贝尔曼误差服从正态分布。我们通过涵盖多种在线与离线环境的广泛数值实验验证了该假设。特别地,我们将逻辑校正应用于多种强化学习基线方法的损失函数中,发现采用逻辑损失的结果始终优于均方误差对应方法。同时,我们通过柯尔莫哥洛夫-斯米尔诺夫检验证实了逻辑分布的可靠性。此外,我们的理论通过基于分布的分析,将贝尔曼误差与比例奖励缩放现象建立联系。进一步地,我们针对逻辑分布采样进行了偏差-方差分解。本研究的理论与实证洞见为未来围绕贝尔曼误差分布的研究与改进奠定了重要基础。