Many applications in Reinforcement Learning (RL) usually have noise or stochasticity present in the environment. Beyond their impact on learning, these uncertainties lead the exact same policy to perform differently, i.e. yield different return, from one roll-out to another. Common evaluation procedures in RL summarise the consequent return distributions using solely the expected return, which does not account for the spread of the distribution. Our work defines this spread as the policy reproducibility: the ability of a policy to obtain similar performance when rolled out many times, a crucial property in some real-world applications. We highlight that existing procedures that only use the expected return are limited on two fronts: first an infinite number of return distributions with a wide range of performance-reproducibility trade-offs can have the same expected return, limiting its effectiveness when used for comparing policies; second, the expected return metric does not leave any room for practitioners to choose the best trade-off value for considered applications. In this work, we address these limitations by recommending the use of Lower Confidence Bound, a metric taken from Bayesian optimisation that provides the user with a preference parameter to choose a desired performance-reproducibility trade-off. We also formalise and quantify policy reproducibility, and demonstrate the benefit of our metrics using extensive experiments of popular RL algorithms on common uncertain RL tasks.
翻译:强化学习中的许多应用通常在环境中存在噪声或随机性。除了对学习的影响外,这些不确定性会导致完全相同的策略在不同回合中表现不同,即产生不同的回报。强化学习中常用的评估过程仅使用期望回报来概括由此产生的回报分布,但并未考虑分布的离散程度。我们的工作将这种离散程度定义为策略可复现性:策略在多次执行时获得相似性能的能力,这在某些实际应用中是一个关键属性。我们指出,仅使用期望回报的现有评估方法存在两方面局限性:首先,无限多个具有不同性能-可复现性权衡的回报分布可能具有相同的期望回报,这限制了其在策略比较中的有效性;其次,期望回报指标无法为实践者提供根据具体应用选择最佳权衡值的空间。在本工作中,我们通过推荐使用置信下界(一种源自贝叶斯优化的度量指标,为用户提供偏好参数以选择所需性能-可复现性权衡)来解决这些局限性。我们还形式化并量化了策略可复现性,并通过在常见不确定性RL任务上对主流RL算法进行大量实验,证明了我们提出的度量指标的优越性。