Advancements in large Vision-Language Models have brought precise, accurate image captioning, vital for advancing multi-modal image understanding and processing. Yet these captions often carry lengthy, intertwined contexts that are difficult to parse and frequently overlook essential cues, posing a great barrier for models like GroundingDINO and SDXL, which lack the strong text encoding and syntax analysis needed to fully leverage dense captions. To address this, we propose BACON, a prompting method that breaks down VLM-generated captions into disentangled, structured elements such as objects, relationships, styles, and themes. This approach not only minimizes confusion from handling complex contexts but also allows for efficient transfer into a JSON dictionary, enabling models without linguistic processing capabilities to easily access key information. We annotated 100,000 image-caption pairs using BACON with GPT-4V and trained an LLaVA captioner on this dataset, enabling it to produce BACON-style captions without relying on costly GPT-4V. Evaluations of overall quality, precision, and recall-as well as user studies-demonstrate that the resulting caption model consistently outperforms other SOTA VLM models in generating high-quality captions. Besides, we show that BACON-style captions exhibit better clarity when applied to various models, enabling them to accomplish previously unattainable tasks or surpass existing SOTA solutions without training. For example, BACON-style captions help GroundingDINO achieve 1.51x higher recall scores on open-vocabulary object detection tasks compared to leading methods.
翻译:大型视觉语言模型的进步带来了精确、准确的图像描述,这对于推进多模态图像理解与处理至关重要。然而,这些描述通常包含冗长且相互交织的上下文,难以解析,并且经常忽略关键线索,这给像GroundingDINO和SDXL这样的模型带来了巨大障碍,因为它们缺乏强大的文本编码和语法分析能力,无法充分利用密集的描述。为解决这一问题,我们提出了BACON,一种提示方法,将VLM生成的描述分解为解耦的结构化元素,如对象、关系、风格和主题。这种方法不仅减少了处理复杂上下文时的混淆,还能高效地转换为JSON字典,使不具备语言处理能力的模型能够轻松访问关键信息。我们使用BACON配合GPT-4V标注了100,000个图像-描述对,并在此数据集上训练了一个LLaVA描述生成器,使其能够在不依赖昂贵的GPT-4V的情况下生成BACON风格的描述。对整体质量、精确度和召回率的评估以及用户研究表明,所得到的描述模型在生成高质量描述方面始终优于其他SOTA VLM模型。此外,我们展示了BACON风格的描述在各种模型中应用时表现出更好的清晰度,使它们能够完成以前无法实现的任务,或在无需训练的情况下超越现有的SOTA解决方案。例如,在开放词汇对象检测任务中,BACON风格的描述帮助GroundingDINO实现了比领先方法高1.51倍的召回分数。