This paper proposes a suite of rationality measures and associated theory for reinforcement learning agents, a property increasingly critical yet rarely explored. We define an action in deployment to be perfectly rational if it maximises the hidden true value function in the steepest direction. The expected value discrepancy of a policy's actions against their rational counterparts, culminating over the trajectory in deployment, is defined to be expected rational risk; an empirical average version in training is also defined. Their difference, termed as rational risk gap, is decomposed into (1) an extrinsic component caused by environment shifts between training and deployment, and (2) an intrinsic one due to the algorithm's generalisability in a dynamic environment. They are upper bounded by, respectively, (1) the $1$-Wasserstein distance between transition kernels and initial state distributions in training and deployment, and (2) the empirical Rademacher complexity of the value function class. Our theory suggests hypotheses on the benefits from regularisers (including layer normalisation, $\ell_2$ regularisation, and weight normalisation) and domain randomisation, as well as the harm from environment shifts. Experiments are in full agreement with these hypotheses. The code is available at https://github.com/EVIEHub/Rationality.
翻译:本文提出了一套用于强化学习智能体的理性度量方法及相关理论,该属性日益重要却鲜有研究。若部署中的某个动作在梯度最陡方向上最大化隐藏的真实价值函数,则称该动作为完全理性的。策略动作与其理性对应动作之间的期望价值差异在部署轨迹上的累积被定义为期望理性风险;同时定义了训练中的经验平均版本。二者的差异(称为理性风险间隙)可分解为:(1)由训练与部署间的环境偏移引起的外在分量,以及(2)由算法在动态环境中泛化能力决定的内在分量。它们分别被上界约束为:(1)训练与部署中转移核及初始状态分布的$1$-Wasserstein距离,以及(2)价值函数类的经验Rademacher复杂度。我们的理论提出了关于正则化器(包括层归一化、$\ell_2$正则化及权重归一化)与领域随机化带来的益处,以及环境偏移所造成危害的假设。实验结果完全支持这些假设。代码发布于https://github.com/EVIEHub/Rationality。