Offline reinforcement learning (RL) have received rising interest due to its appealing data efficiency. The present study addresses behavior estimation, a task that lays the foundation of many offline RL algorithms. Behavior estimation aims at estimating the policy with which training data are generated. In particular, this work considers a scenario where the data are collected from multiple sources. In this case, neglecting data heterogeneity, existing approaches for behavior estimation suffers from behavior misspecification. To overcome this drawback, the present study proposes a latent variable model to infer a set of policies from data, which allows an agent to use as behavior policy the policy that best describes a particular trajectory. This model provides with a agent fine-grained characterization for multi-source data and helps it overcome behavior misspecification. This work also proposes a learning algorithm for this model and illustrates its practical usage via extending an existing offline RL algorithm. Lastly, with extensive evaluation this work confirms the existence of behavior misspecification and the efficacy of the proposed model.
翻译:离线强化学习(Offline RL)因其引人注目的数据效率而日益受到关注。本研究聚焦于行为估计——这是众多离线强化学习算法的基础任务。行为估计旨在估计生成训练数据的策略。特别是,本工作考虑了数据来自多个来源的场景。在此情况下,忽视数据异质性会导致现有行为估计方法出现行为误设。为克服这一缺陷,本研究提出了一种潜在变量模型,用于从数据中推断一组策略,使智能体能够将最能描述特定轨迹的策略作为行为策略。该模型为智能体提供了对多源数据的细粒度刻画,帮助其克服行为误设。本工作还提出了该模型的学习算法,并通过扩展现有离线强化学习算法展示了其实际应用价值。最后,通过大量评估,本工作证实了行为误设的存在以及所提模型的有效性。