Natural agents can effectively learn from multiple data sources that differ in size, quality, and types of measurements. We study this heterogeneity in the context of offline reinforcement learning (RL) by introducing a new, practically motivated semi-supervised setting. Here, an agent has access to two sets of trajectories: labelled trajectories containing state, action and reward triplets at every timestep, along with unlabelled trajectories that contain only state and reward information. For this setting, we develop and study a simple meta-algorithmic pipeline that learns an inverse dynamics model on the labelled data to obtain proxy-labels for the unlabelled data, followed by the use of any offline RL algorithm on the true and proxy-labelled trajectories. Empirically, we find this simple pipeline to be highly successful -- on several D4RL benchmarks~\cite{fu2020d4rl}, certain offline RL algorithms can match the performance of variants trained on a fully labelled dataset even when we label only 10\% of trajectories which are highly suboptimal. To strengthen our understanding, we perform a large-scale controlled empirical study investigating the interplay of data-centric properties of the labelled and unlabelled datasets, with algorithmic design choices (e.g., choice of inverse dynamics, offline RL algorithm) to identify general trends and best practices for training RL agents on semi-supervised offline datasets.
翻译:自然智能体能够有效从大小、质量和测量类型不同的多源数据中学习。我们通过引入一种基于实际需求的半监督新设置,在离线强化学习(RL)背景下研究这种异构性。在该设置中,智能体可访问两组轨迹:包含每一步的状态、动作和奖励三元组的标记轨迹,以及仅包含状态和奖励信息的未标记轨迹。针对这一设置,我们开发并研究了一种简单的元算法流程:首先在标记数据上学习逆动力学模型,为未标记数据生成代理标签,然后将任意离线强化学习算法应用于真实轨迹和代理标签轨迹。实验表明,这一简单流程非常成功——在多个D4RL基准测试中,即使仅标记10%的高度次优轨迹,某些离线强化学习算法也能达到与在完全标记数据集上训练的变体相当的性能。为加深理解,我们开展大规模受控实证研究,探究标记数据集和未标记数据集的数据中心属性与算法设计选择(如逆动力学模型选择、离线强化学习算法)之间的相互作用,以识别在半监督离线数据集上训练RL代理的通用趋势和最佳实践。