Offline reinforcement learning provides a viable approach to obtain advanced control strategies for dynamical systems, in particular when direct interaction with the environment is not available. In this paper, we introduce a conceptual extension for model-based policy search methods, called variable objective policy (VOP). With this approach, policies are trained to generalize efficiently over a variety of objectives, which parameterize the reward function. We demonstrate that by altering the objectives passed as input to the policy, users gain the freedom to adjust its behavior or re-balance optimization targets at runtime, without need for collecting additional observation batches or re-training.
翻译:离线强化学习为获取动态系统的高级控制策略提供了一种可行途径,特别是在无法与环境直接交互的场景下。本文提出了一种基于模型策略搜索方法的概念性扩展——可变目标策略(Variable Objective Policy, VOP)。通过该方法,策略被训练为能够高效泛化至多种以参数化奖励函数表示的目标。我们证明:通过改变输入策略的目标参数,用户可在运行时自由调整其行为或重新平衡优化目标,而无需收集额外的观测数据批次或重新训练模型。