Foundation Policies with Hilbert Representations

Unsupervised and self-supervised objectives, such as next token prediction, have enabled pre-training generalist models from large amounts of unlabeled data. In reinforcement learning (RL), however, finding a truly general and scalable unsupervised pre-training objective for generalist policies from offline data remains a major open question. While a number of methods have been proposed to enable generic self-supervised RL, based on principles such as goal-conditioned RL, behavioral cloning, and unsupervised skill learning, such methods remain limited in terms of either the diversity of the discovered behaviors, the need for high-quality demonstration data, or the lack of a clear adaptation mechanism for downstream tasks. In this work, we propose a novel unsupervised framework to pre-train generalist policies that capture diverse, optimal, long-horizon behaviors from unlabeled offline data such that they can be quickly adapted to any arbitrary new tasks in a zero-shot manner. Our key insight is to learn a structured representation that preserves the temporal structure of the underlying environment, and then to span this learned latent space with directional movements, which enables various zero-shot policy "prompting" schemes for downstream tasks. Through our experiments on simulated robotic locomotion and manipulation benchmarks, we show that our unsupervised policies can solve goal-conditioned and general RL tasks in a zero-shot fashion, even often outperforming prior methods designed specifically for each setting. Our code and videos are available at https://seohong.me/projects/hilp/.

翻译：无监督与自监督目标（如下一个词元预测）已使得从大量无标签数据中预训练通用模型成为可能。然而在强化学习领域，如何从离线数据中找到一种真正通用且可扩展的无监督预训练目标来构建通用策略，仍然是一个重要的开放性问题。尽管已有多种方法被提出以实现通用的自监督强化学习，这些方法基于目标条件强化学习、行为克隆和无监督技能学习等原理，但它们在发现行为的多样性、对高质量示范数据的需求，或缺乏明确的下游任务适应机制等方面仍存在局限。本文提出了一种新颖的无监督框架，用于预训练通用策略，该策略能够从未标注的离线数据中捕获多样、最优、长视野的行为，从而能够以零样本方式快速适应任意新任务。我们的核心洞见在于学习一种能够保持底层环境时序结构的结构化表示，然后在该学习到的潜在空间中通过定向运动进行覆盖，这使得针对下游任务的各种零样本策略“提示”方案成为可能。通过在模拟机器人运动与操作基准测试中的实验，我们表明所提出的无监督策略能够以零样本方式解决目标条件及通用强化学习任务，其表现甚至常常优于专门为各场景设计的先前方法。我们的代码与演示视频可在 https://seohong.me/projects/hilp/ 获取。