Reinforcement learning of real-world tasks is very data inefficient, and extensive simulation-based modelling has become the dominant approach for training systems. However, in human-robot interaction and many other real-world settings, there is no appropriate one-model-for-all due to differences in individual instances of the system (e.g. different people) or necessary oversimplifications in the simulation models. This requires two approaches: 1. either learning the individual system's dynamics approximately from data which requires data-intensive training or 2. using a complete digital twin of the instances, which may not be realisable in many cases. We introduce two approaches: co-kriging adjustments (CKA) and ridge regression adjustment (RRA) as novel ways to combine the advantages of both approaches. Our adjustment methods are based on an auto-regressive AR1 co-kriging model that we integrate with GP priors. This yield a data- and simulation-efficient way of using simplistic simulation models (e.g., simple two-link model) and rapidly adapting them to individual instances (e.g., biomechanics of individual people). Using CKA and RRA, we obtain more accurate uncertainty quantification of the entire system's dynamics than pure GP-based and AR1 methods. We demonstrate the efficiency of co-kriging adjustment with an interpretable reinforcement learning control example, learning to control a biomechanical human arm using only a two-link arm simulation model (offline part) and CKA derived from a small amount of interaction data (on-the-fly online). Our method unlocks an efficient and uncertainty-aware way to implement reinforcement learning methods in real world complex systems for which only imperfect simulation models exist.
翻译:现实世界任务的强化学习数据效率极低,基于仿真的广泛建模已成为训练系统的主要方法。然而,在人机交互及其他许多真实场景中,由于系统实例的个体差异(如不同个体)或仿真模型中不可避免的过度简化,不存在适用于所有情况的统一模型。这需要两种方法:1)从数据中近似学习个体系统动力学,这需要大量数据训练;或2)使用实例的完整数字孪生,这在许多情况下可能无法实现。我们提出两种方法:协同克里金调整(CKA)和岭回归调整(RRA),作为结合这两种方法优势的新途径。我们的调整方法基于自回归AR1协同克里金模型,并将其与高斯过程先验集成。这提供了一种数据高效且仿真高效的方式,利用简化的仿真模型(如简单的两连杆模型)并快速适应个体实例(如个体生物力学)。使用CKA和RRA,我们获得了比纯基于GP和AR1方法更准确的整体系统动力学不确定性量化。我们通过一个可解释的强化学习控制示例展示了协同克里金调整的效率:仅使用两连杆手臂仿真模型(离线部分)和从少量交互数据(在线实时)导出的CKA,学习控制生物力学人体手臂。我们的方法提供了一种高效且具有不确定性意识的方式,在仅存在不完善仿真模型的现实复杂系统中实现强化学习方法。