CLOVA: A Closed-Loop Visual Assistant with Tool Usage and Update

Utilizing large language models (LLMs) to compose off-the-shelf visual tools represents a promising avenue of research for developing robust visual assistants capable of addressing diverse visual tasks. However, these methods often overlook the potential for continual learning, typically by freezing the utilized tools, thus limiting their adaptation to environments requiring new knowledge. To tackle this challenge, we propose CLOVA, a Closed-Loop Visual Assistant, which operates within a framework encompassing inference, reflection, and learning phases. During the inference phase, LLMs generate programs and execute corresponding tools to complete assigned tasks. In the reflection phase, a multimodal global-local reflection scheme analyzes human feedback to determine which tools require updating. Lastly, the learning phase employs three flexible approaches to automatically gather training data and introduces a novel prompt tuning scheme to update the tools, allowing CLOVA to efficiently acquire new knowledge. Experimental findings demonstrate that CLOVA surpasses existing tool-usage methods by 5% in visual question answering and multiple-image reasoning, by 10% in knowledge tagging, and by 20% in image editing. These results underscore the significance of the continual learning capability in general visual assistants.

翻译：利用大型语言模型（LLMs）组合现成的视觉工具，是开发能够处理多样化视觉任务的鲁棒视觉助手的一种有前景的研究方向。然而，这些方法往往忽视了持续学习的可能性，通常冻结所用的工具，从而限制了它们适应需要新知识的环境的能力。为应对这一挑战，我们提出CLOVA，一种闭环视觉助手，其运行框架包含推理、反思和学习三个阶段。在推理阶段，LLMs生成程序并执行相应工具以完成指定任务。在反思阶段，一种多模态全局-局部反思机制分析人类反馈，以确定哪些工具需要更新。最后，学习阶段采用三种灵活方法自动收集训练数据，并引入一种新颖的提示调优方案来更新工具，使CLOVA能够高效获取新知识。实验结果表明，CLOVA在视觉问答和多图像推理任务上比现有工具使用方法提升5%，在知识标注上提升10%，在图像编辑上提升20%。这些结果凸显了持续学习能力在通用视觉助手中的重要性。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日