Zero-shot reinforcement learning (RL) algorithms aim to learn a family of policies from a reward-free dataset, and recover optimal policies for any reward function directly at test time. Naturally, the quality of the pretraining dataset determines the performance of the recovered policies across tasks. However, pre-collecting a relevant, diverse dataset without prior knowledge of the downstream tasks of interest remains a challenge. In this work, we study $\textit{online}$ zero-shot RL for quadrupedal control on real robotic systems, building upon the Forward-Backward (FB) algorithm. We observe that undirected exploration yields low-diversity data, leading to poor downstream performance and rendering policies impractical for direct hardware deployment. Therefore, we introduce FB-MEBE, an online zero-shot RL algorithm that combines an unsupervised behavior exploration strategy with a regularization critic. FB-MEBE promotes exploration by maximizing the entropy of the achieved behavior distribution. Additionally, a regularization critic shapes the recovered policies toward more natural and physically plausible behaviors. We empirically demonstrate that FB-MEBE achieves and improved performance compared to other exploration strategies in a range of simulated downstream tasks, and that it renders natural policies that can be seamlessly deployed to hardware without further finetuning. Videos and code available on our website.
翻译:零样本强化学习算法旨在从无奖励数据集中学习一族策略,并在测试阶段直接恢复任意奖励函数对应的最优策略。自然,预训练数据集的质量决定了恢复策略在不同任务中的性能表现。然而,在没有下游任务先验知识的情况下,预先收集相关且多样化的数据集仍是一个挑战。本研究基于前向-后向算法,在真实机器人系统上研究四足控制的在线零样本强化学习。我们发现,无导向探索产生的数据多样性较低,导致下游性能较差,并使策略难以直接部署到硬件上。因此,我们提出FB-MEBE——一种结合无监督行为探索策略与正则化评论者的在线零样本强化学习算法。FB-MEBE通过最大化已实现行为分布的熵来促进探索,同时,正则化评论者将恢复的策略塑造成更自然且物理上更合理的行为。实验表明,在一系列模拟下游任务中,FB-MEBE的性能优于其他探索策略,并能生成可直接部署到硬件上而无需进一步微调的自然策略。视频和代码参见我们的网站。