Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action, reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop and study a simple meta-algorithmic pipeline that learns an inverse dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful - on several D4RL benchmarks, certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10% trajectories from the low return regime. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.
翻译:自然智能体能够有效学习来自不同来源的数据,这些数据在规模、质量和测量类型上存在差异。我们在离线强化学习(RL)背景下研究这种异质性,提出了一种具有实际意义的新半监督设置。在此设置中,智能体可访问两类轨迹:包含每个时间步状态、动作、奖励三元组的标记轨迹,以及仅包含状态和奖励信息的未标记轨迹。针对该设置,我们开发并研究了一种简单的元算法流程:首先在标记数据上学习逆向动力学模型,为未标记数据生成代理标签,随后将任意离线RL算法应用于真实标签与代理标签的轨迹上。实验表明,这种简单流程非常成功——在多个D4RL基准测试中,即使仅标记低回报区域中10%的轨迹,某些离线RL算法仍能达到与基于完全标记数据集训练变体相当的性能。为深化理解,我们开展大规模受控实证研究,探究标记与未标记数据集的数据中心属性与算法设计选择(如逆向动力学模型选取、离线RL算法)之间的相互作用,从而识别训练基于半监督离线数据集的RL智能体时的一般趋势与最佳实践。