Large, general-purpose robotic policies trained on diverse demonstration datasets have been shown to be remarkably effective both for controlling a variety of robots in a range of different scenes, and for acquiring broad repertoires of manipulation skills. However, the data that such policies are trained on is generally of mixed quality -- not only are human-collected demonstrations unlikely to perform the task perfectly, but the larger the dataset is, the harder it is to curate only the highest quality examples. It also remains unclear how optimal data from one embodiment is for training on another embodiment. In this paper, we present a general and broadly applicable approach that enhances the performance of such generalist robot policies at deployment time by re-ranking their actions according to a value function learned via offline RL. This approach, which we call Value-Guided Policy Steering (V-GPS), is compatible with a wide range of different generalist policies, without needing to fine-tune or even access the weights of the policy. We show that the same value function can improve the performance of five different state-of-the-art policies with different architectures, even though they were trained on distinct datasets, attaining consistent performance improvement on multiple robotic platforms across a total of 12 tasks. Code and videos can be found at: https://nakamotoo.github.io/V-GPS
翻译:在多样化演示数据集上训练的大型通用机器人策略已被证明在控制多种机器人于不同场景中以及获取广泛的操控技能方面具有显著效果。然而,此类策略训练所用数据的质量通常参差不齐——不仅人工收集的演示不太可能完美执行任务,而且数据集越大,筛选出最高质量样本的难度也越高。此外,来自一种具身形式的最优数据对于另一种具身形式的训练效果仍不明确。本文提出一种通用且广泛适用的方法,通过在部署时根据离线强化学习习得的价值函数对动作进行重排序,来提升此类通用机器人策略的性能。该方法称为价值引导策略导向(V-GPS),兼容多种不同的通用策略,且无需微调甚至访问策略的权重。研究表明,同一价值函数能够改进五种不同架构的先进策略性能(尽管它们基于不同数据集训练),在多个机器人平台上总计12项任务中实现了持续的性能提升。代码与视频可见:https://nakamotoo.github.io/V-GPS