Recently, there has been considerable attention towards leveraging large language models (LLMs) to enhance decision-making processes. However, aligning the natural language text instructions generated by LLMs with the vectorized operations required for execution presents a significant challenge, often necessitating task-specific details. To circumvent the need for such task-specific granularity, inspired by preference-based policy learning approaches, we investigate the utilization of multimodal LLMs to provide automated preference feedback solely from image inputs to guide decision-making. In this study, we train a multimodal LLM, termed CriticGPT, capable of understanding trajectory videos in robot manipulation tasks, serving as a critic to offer analysis and preference feedback. Subsequently, we validate the effectiveness of preference labels generated by CriticGPT from a reward modeling perspective. Experimental evaluation of the algorithm's preference accuracy demonstrates its effective generalization ability to new tasks. Furthermore, performance on Meta-World tasks reveals that CriticGPT's reward model efficiently guides policy learning, surpassing rewards based on state-of-the-art pre-trained representation models.
翻译:近期,利用大语言模型提升决策能力的研究备受关注。然而,将大语言模型生成的文本指令与执行所需的向量化操作对齐仍面临重大挑战,往往需要任务特定细节。为规避此类任务级粒度的需求,受基于偏好的策略学习方法的启发,本研究探索利用多模态大语言模型仅从图像输入中提供自动偏好反馈,以指导决策过程。我们训练了一个名为CriticGPT的多模态大语言模型,该模型能够理解机器人操控任务中的轨迹视频,作为批评者提供分析与偏好反馈。随后,我们从奖励建模角度验证了CriticGPT生成的偏好标签的有效性。对算法偏好精度的实验评估表明,其在新任务上具备良好的泛化能力。此外,在Meta-World任务上的性能测试揭示,CriticGPT的奖励模型能高效引导策略学习,优于基于最先进预训练表示模型的奖励方法。