Reinforcement learning (RL) has shown promise for decision-making tasks in real-world applications. One practical framework involves training parameterized policy models from an offline dataset and subsequently deploying them in an online environment. However, this approach can be risky since the offline training may not be perfect, leading to poor performance of the RL models that may take dangerous actions. To address this issue, we propose an alternative framework that involves a human supervising the RL models and providing additional feedback in the online deployment phase. We formalize this online deployment problem and develop two approaches. The first approach uses model selection and the upper confidence bound algorithm to adaptively select a model to deploy from a candidate set of trained offline RL models. The second approach involves fine-tuning the model in the online deployment phase when a supervision signal arrives. We demonstrate the effectiveness of these approaches for robot locomotion control and traffic light control tasks through empirical validation.
翻译:强化学习(RL)在现实应用中的决策任务中展现出巨大潜力。一种实际框架涉及从离线数据集中训练参数化策略模型,随后将其部署到在线环境中。然而,由于离线训练可能存在不完美之处,导致RL模型表现欠佳并可能采取危险行动,因此这种方法具有风险性。为解决此问题,我们提出一种替代框架:由人类监督RL模型,并在在线部署阶段提供额外反馈。我们形式化了这一在线部署问题,并开发了两种方法。第一种方法利用模型选择与上置信界算法,从已训练的离线RL模型候选集中自适应选择模型进行部署。第二种方法则在在线部署阶段当监督信号到达时对模型进行微调。通过机器人运动控制与交通信号灯控制任务的实证验证,我们展示了这些方法的有效性。