Foundation Policies with Hilbert Representations

Unsupervised and self-supervised objectives, such as next token prediction, have enabled pre-training generalist models from large amounts of unlabeled data. In reinforcement learning (RL), however, finding a truly general and scalable unsupervised pre-training objective for generalist policies from offline data remains a major open question. While a number of methods have been proposed to enable generic self-supervised RL, based on principles such as goal-conditioned RL, behavioral cloning, and unsupervised skill learning, such methods remain limited in terms of either the diversity of the discovered behaviors, the need for high-quality demonstration data, or the lack of a clear prompting or adaptation mechanism for downstream tasks. In this work, we propose a novel unsupervised framework to pre-train generalist policies that capture diverse, optimal, long-horizon behaviors from unlabeled offline data such that they can be quickly adapted to any arbitrary new tasks in a zero-shot manner. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment, and then to span this learned latent space with directional movements, which enables various zero-shot policy "prompting" schemes for downstream tasks. Through our experiments on simulated robotic locomotion and manipulation benchmarks, we show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion, even often outperforming prior methods designed specifically for each setting. Our code and videos are available at https://seohong.me/projects/hilp/

翻译：无监督和自监督目标（如下一个词元预测）使得从大量无标签数据中预训练通用模型成为可能。然而在强化学习领域，如何从离线数据中找到真正通用且可扩展的无监督预训练目标来训练通用策略仍是一个重大开放性问题。尽管已有多种方法基于不同原理（如目标条件强化学习、行为克隆和无监督技能学习）实现了通用自监督强化学习，但这些方法在发现行为的多样性、对高质量示范数据的需求，以及缺乏清晰的下游任务提示或适应机制方面仍存在局限。本研究提出一种新颖的无监督框架，用于预训练能够从无标签离线数据中捕获多样、最优且长时程行为的通用策略，从而使其能以零样本方式快速适应任意新任务。我们的核心洞见在于学习一种能保留环境时序结构的结构化表征，并通过定向运动张满该学习到的潜在空间，进而实现对下游任务的各种零样本策略"提示"方案。通过在模拟机器人运动与操控基准上的实验，我们证明这种无监督策略能以零样本方式解决目标条件和通用强化学习任务，其性能甚至常超越针对特定场景专门设计的先前方法。我们的代码和视频已开源至 https://seohong.me/projects/hilp/