Recently, multimodal large language models (MLLMs) have demonstrated strong visual understanding and decision-making capabilities, enabling the exploration of autonomously improving MLLMs in unknown environments. However, external feedback like human or environmental feedback is not always available. To address this challenge, existing methods primarily focus on enhancing the decision-making capabilities of MLLMs through voting and scoring mechanisms, while little effort has been paid to improving the environmental comprehension of MLLMs in unknown environments. To fully unleash the self-learning potential of MLLMs, we propose a novel actor-critic self-learning paradigm, dubbed SELU, inspired by the actor-critic paradigm in reinforcement learning. The critic employs self-asking and hindsight relabeling to extract knowledge from interaction trajectories collected by the actor, thereby augmenting its environmental comprehension. Simultaneously, the actor is improved by the self-feedback provided by the critic, enhancing its decision-making. We evaluate our method in the AI2-THOR and VirtualHome environments, and SELU achieves critic improvements of approximately 28% and 30%, and actor improvements of about 20% and 24% via self-learning.
翻译:近年来,多模态大语言模型(MLLMs)展现出强大的视觉理解与决策能力,使得探索其在未知环境中自主提升成为可能。然而,人类或环境反馈等外部反馈并非总是可用。为应对这一挑战,现有方法主要通过投票与评分机制来增强MLLMs的决策能力,却鲜有研究致力于提升MLLMs在未知环境中的环境理解能力。为充分释放MLLMs的自学习潜力,受强化学习中行动者-评论者范式的启发,我们提出了一种新颖的行动者-评论者自学习范式,称为SELU。评论者通过自我提问与事后重标记,从行动者收集的交互轨迹中提取知识,从而增强其环境理解能力。同时,行动者通过评论者提供的自我反馈得以改进,提升其决策能力。我们在AI2-THOR和VirtualHome环境中评估了我们的方法,SELU通过自学习实现了评论者约28%和30%的提升,以及行动者约20%和24%的提升。