We propose a framework that leverages foundation models as teachers, guiding a reinforcement learning agent to acquire semantically meaningful behavior without human feedback. In our framework, the agent receives task instructions grounded in a training environment from large language models. Then, a vision-language model guides the agent in learning the multi-task language-conditioned policy by providing reward feedback. We demonstrate that our method can learn semantically meaningful skills in a challenging open-ended MineDojo environment while prior unsupervised skill discovery methods struggle. Additionally, we discuss observed challenges of using off-the-shelf foundation models as teachers and our efforts to address them.
翻译:摘要:我们提出一种框架,利用基础模型作为教师,引导强化学习代理无需人类反馈即可获得具有语义意义的行为。在该框架中,代理从大语言模型接收基于训练环境的任务指令;随后,视觉-语言模型通过提供奖励反馈,指导代理学习多任务语言条件化策略。我们证明,该方法能在具有挑战性的开放型MineDojo环境中习得具有语义意义的技能,而此前无监督技能发现方法在此场景中表现困难。此外,本文讨论了使用现成基础模型作为教师时观察到的挑战及应对措施。