This paper presents Bag-of-Concept Graph (BACON) to gift models with limited linguistic abilities to taste the privilege of Vision Language Models (VLMs) and boost downstream tasks such as detection, visual question answering (VQA), and image generation. Since the visual scenes in physical worlds are structured with complex relations between objects, BACON breaks down annotations into basic minimum elements and presents them in a graph structure. Element-wise style enables easy understanding, and structural composition liberates difficult locating. Careful prompt design births the BACON captions with the help of public-available VLMs and segmentation methods. In this way, we gather a dataset with 100K annotated images, which endow VLMs with remarkable capabilities, such as accurately generating BACON, transforming prompts into BACON format, envisioning scenarios in the style of BACONr, and dynamically modifying elements within BACON through interactive dialogue and more. Wide representative experiments, including detection, VQA, and image generation tasks, tell BACON as a lifeline to achieve previous out-of-reach tasks or excel in their current cutting-edge solutions.
翻译:本文提出概念袋图(BACON),旨在使语言能力有限的模型能够体验视觉语言模型(VLM)的优势,并提升检测、视觉问答(VQA)与图像生成等下游任务性能。由于物理世界中的视觉场景由物体间的复杂关系构成,BACON将标注信息分解为基本最小元素并以图结构呈现。元素化风格便于理解,结构化组合则解放了精确定位的困难。通过精心设计的提示模板,并借助公开可用的VLM与分割方法,我们生成了BACON标注描述。基于此,我们构建了包含10万张标注图像的数据集,该数据集赋予VLM以下卓越能力:精确生成BACON图、将提示转换为BACON格式、以BACON风格构想场景、通过交互式对话动态修改BACON中的元素等。涵盖检测、VQA与图像生成的广泛代表性实验表明,BACON是实现以往难以企及的任务或在当前尖端解决方案中取得突破的关键支撑。