Offline reinforcement learning (RL) have received rising interest due to its appealing data efficiency. The present study addresses behavior estimation, a task that lays the foundation of many offline RL algorithms. Behavior estimation aims at estimating the policy with which training data are generated. In particular, this work considers a scenario where the data are collected from multiple sources. In this case, neglecting data heterogeneity, existing approaches for behavior estimation suffers from behavior misspecification. To overcome this drawback, the present study proposes a latent variable model to infer a set of policies from data, which allows an agent to use as behavior policy the policy that best describes a particular trajectory. This model provides with a agent fine-grained characterization for multi-source data and helps it overcome behavior misspecification. This work also proposes a learning algorithm for this model and illustrates its practical usage via extending an existing offline RL algorithm. Lastly, with extensive evaluation this work confirms the existence of behavior misspecification and the efficacy of the proposed model.
翻译:离线强化学习因其优越的数据效率而引起广泛关注。本研究聚焦于行为估计这一任务,该任务是许多离线强化学习算法的基础。行为估计旨在估计生成训练数据的策略。具体而言,本文考虑了数据来自多个来源的场景。在此情况下,现有行为估计方法因忽略数据异质性而遭受行为误设问题。为克服这一缺陷,本研究提出一种潜变量模型,用于从数据中推断一组策略,使智能体能够将最符合特定轨迹的策略用作行为策略。该模型为智能体提供了对多源数据的细粒度表征,有助于其克服行为误设。本文还提出了该模型的学习算法,并通过对现有离线强化学习算法的扩展,展示了其实际应用。最后,通过广泛评估,本研究证实了行为误设的存在性以及所提模型的有效性。