Rewards remain an uninterpretable way to specify tasks for Reinforcement Learning, as humans are often unable to predict the optimal behavior of any given reward function, leading to poor reward design and reward hacking. Language presents an appealing way to communicate intent to agents and bypass reward design, but prior efforts to do so have been limited by costly and unscalable labeling efforts. In this work, we propose a method for a completely unsupervised alternative to grounding language instructions in a zero-shot manner to obtain policies. We present a solution that takes the form of imagine, project, and imitate: The agent imagines the observation sequence corresponding to the language description of a task, projects the imagined sequence to our target domain, and grounds it to a policy. Video-language models allow us to imagine task descriptions that leverage knowledge of tasks learned from internet-scale video-text mappings. The challenge remains to ground these generations to a policy. In this work, we show that we can achieve a zero-shot language-to-behavior policy by first grounding the imagined sequences in real observations of an unsupervised RL agent and using a closed-form solution to imitation learning that allows the RL agent to mimic the grounded observations. Our method, RLZero, is the first to our knowledge to show zero-shot language to behavior generation abilities without any supervision on a variety of tasks on simulated domains. We further show that RLZero can also generate policies zero-shot from cross-embodied videos such as those scraped from YouTube.
翻译:奖励函数作为强化学习中任务指定方式仍难以解释,因为人类往往无法预测任意奖励函数的最优行为,导致奖励设计缺陷与奖励破解问题。语言为向智能体传达意图并规避奖励设计提供了诱人途径,但先前研究受限于昂贵且难以扩展的标注工作。本研究提出一种完全无监督的替代方案,以零样本方式将语言指令具象化为策略。我们提出的解决方案遵循"想象-投影-模仿"框架:智能体根据任务的语言描述想象对应的观测序列,将想象序列投影至目标域,并将其具象化为策略。视频-语言模型使我们能够利用从互联网规模视频-文本映射中学到的任务知识来想象任务描述。关键挑战在于如何将这些生成内容具象化为策略。本研究表明,通过先将想象序列锚定在无监督强化学习智能体的真实观测中,再利用闭式解模仿学习方法使强化学习智能体模拟锚定观测,即可实现零样本语言到行为策略的生成。我们的方法RLZero是首个在模拟领域中无需任何监督即可在多种任务上实现零样本语言到行为生成能力的方法。进一步实验表明,RLZero还能从跨实体视频(如从YouTube抓取的视频)中零样本生成策略。