Reinforcement learning agents can achieve superhuman performance in static tasks but are costly to train and fragile to task changes. This limits their deployment in real-world scenarios where training experience is expensive or the context changes through factors like sensor degradation, environmental processes or changing mission priorities. Lifelong reinforcement learning aims to improve sample efficiency and adaptability by studying how agents perform in evolving problems. The difficulty that these changes pose to an agent is rarely measured directly, however. Agent performances can be compared across a change, but this is often prohibitively expensive. We propose Change-Induced Regret Proxy (CHIRP) metrics, a class of metrics for approximating a change's difficulty while avoiding the high costs of using trained agents. A relationship between a CHIRP metric and agent performance is identified in two environments, a simple grid world and MetaWorld's suite of robotic arm tasks. We demonstrate two uses for these metrics: for learning, an agent that clusters MDPs based on a CHIRP metric achieves $17\%$ higher average returns than three existing agents in a sequence of MetaWorld tasks. We also show how a CHIRP can be calibrated to compare the difficulty of changes across distinctly different environments.
翻译:强化学习智能体在静态任务中能够实现超越人类的表现,但训练成本高昂且对任务变化极为敏感。这限制了它们在现实场景中的部署,因为现实场景中训练经验获取代价昂贵,或会因传感器性能退化、环境过程变化或任务优先级调整等因素而发生情境变化。终身强化学习旨在通过研究智能体在动态演化问题中的表现,提高样本效率与适应能力。然而,这些变化对智能体造成的困难程度很少被直接量化。虽然可以对比智能体在变化前后的性能表现,但这种方法往往代价过高。本文提出变化诱导遗憾代理(CHIRP)度量指标,该类指标可在避免使用训练后智能体所产生高成本的前提下,近似评估任务变化的困难程度。我们在两个环境中(简单网格世界与MetaWorld机械臂任务集)验证了CHIRP指标与智能体性能之间的关联关系。我们展示了该指标的两类应用:在学习场景中,基于CHIRP指标对马尔可夫决策过程进行聚类的智能体,在MetaWorld连续任务序列中实现了比现有三种智能体平均回报高$17\%$的表现。此外,我们还演示了如何通过校准CHIRP指标来比较截然不同环境间任务变化的相对难度。