Utilizing large language models (LLMs) to compose off-the-shelf visual tools represents a promising avenue of research for developing robust visual assistants capable of addressing diverse visual tasks. However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge. To tackle this challenge, we propose CLOVA, a Closed-Loop Visual Assistant, which operates within a framework encompassing inference, reflection, and learning phases. During the inference phase, LLMs generate programs and execute corresponding tools to complete assigned tasks. In the reflection phase, a multimodal global-local reflection scheme analyzes human feedback to determine which tools require updating. Lastly, the learning phase employs three flexible approaches to automatically gather training data and introduces a novel prompt tuning scheme to update the tools, allowing CLOVA to efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image editing. These results underscore the significance of the continual learning capability in general visual assistants.
翻译:利用大型语言模型(LLMs)组合现成的视觉工具,是开发能够处理多样化视觉任务的鲁棒视觉助手的一种有前景的研究方向。然而,这些方法往往忽视了持续学习的可能性,通常冻结所用的工具,从而限制了它们适应需要新知识的环境的能力。为应对这一挑战,我们提出CLOVA,一种闭环视觉助手,其运行框架包含推理、反思和学习三个阶段。在推理阶段,LLMs生成程序并执行相应工具以完成指定任务。在反思阶段,一种多模态全局-局部反思机制分析人类反馈,以确定哪些工具需要更新。最后,学习阶段采用三种灵活方法自动收集训练数据,并引入一种新颖的提示调优方案来更新工具,使CLOVA能够高效获取新知识。实验结果表明,CLOVA在视觉问答和多图像推理任务上比现有工具使用方法提升5%,在知识标注上提升10%,在图像编辑上提升20%。这些结果凸显了持续学习能力在通用视觉助手中的重要性。