Batch reinforcement learning enables policy learning without direct interaction with the environment during training, relying exclusively on previously collected sets of interactions. This approach is, therefore, well-suited for high-risk and cost-intensive applications, such as industrial control. Learned policies are commonly restricted to act in a similar fashion as observed in the batch. In a real-world scenario, learned policies are deployed in the industrial system, inevitably leading to the collection of new data that can subsequently be added to the existing recording. The process of learning and deployment can thus take place multiple times throughout the lifespan of a system. In this work, we propose to exploit this iterative nature of applying offline reinforcement learning to guide learned policies towards efficient and informative data collection during deployment, leading to continuous improvement of learned policies while remaining within the support of collected data. We present an algorithmic methodology for iterative batch reinforcement learning based on ensemble-based model-based policy search, augmented with safety and, importantly, a diversity criterion.
翻译:批量强化学习使得策略学习无需在训练期间直接与环境交互,仅依赖于先前收集的交互数据集。因此,该方法非常适合高风险和高成本应用场景,如工业控制。习得的策略通常被限制在与批量数据中观察到的行为模式相似的范围内。在现实场景中,习得的策略被部署于工业系统,不可避免地会收集到新数据,这些数据随后可被添加至现有记录中。因此,在学习与部署的过程中,系统在其生命周期内可能经历多次迭代。在本研究中,我们提出利用离线强化学习的这种迭代特性,引导习得策略在部署期间进行高效且信息丰富的数据收集,从而使习得策略在保持在已收集数据支持范围内的同时实现持续改进。我们提出一种基于集成模型策略搜索的迭代批量强化学习算法框架,该框架增强了安全性约束,并特别引入了多样性准则。