Enhancing Large Vision Language Models with Self-Training on Image Comprehension

Large vision language models (LVLMs) integrate large language models (LLMs) with pre-trained vision encoders, thereby activating the perception capability of the model to understand image inputs for different queries and conduct subsequent reasoning. Improving this capability requires high-quality vision-language data, which is costly and labor-intensive to acquire. Self-training approaches have been effective in single-modal settings to alleviate the need for labeled data by leveraging model's own generation. However, effective self-training remains a challenge regarding the unique visual perception and reasoning capability of LVLMs. To address this, we introduce Self-Training on Image Comprehension (STIC), which emphasizes a self-training approach specifically for image comprehension. First, the model self-constructs a preference dataset for image descriptions using unlabeled images. Preferred responses are generated through a step-by-step prompt, while dis-preferred responses are generated from either corrupted images or misleading prompts. To further self-improve reasoning on the extracted visual information, we let the model reuse a small portion of existing instruction-tuning data and append its self-generated image descriptions to the prompts. We validate the effectiveness of STIC across seven different benchmarks, demonstrating substantial performance gains of 4.0% on average while using 70% less supervised fine-tuning data than the current method. Further studies investigate various components of STIC and highlight its potential to leverage vast quantities of unlabeled images for self-training. Code and data are made publicly available.

翻译：大型视觉语言模型（LVLMs）通过将大语言模型（LLMs）与预训练的视觉编码器相结合，从而激活模型感知图像输入以理解不同查询并进行后续推理的能力。提升该能力需要高质量的视觉-语言数据，而此类数据的获取成本高昂且耗费人力。自训练方法在单模态场景中已证明能有效减少对标注数据的依赖，其通过利用模型自身生成的数据实现。然而，针对LVLMs独特的视觉感知与推理能力，有效的自训练仍面临挑战。为此，我们提出了图像理解自训练（STIC），该方法专注于图像理解任务的自训练策略。首先，模型利用未标注图像自构建关于图像描述的偏好数据集：偏好响应通过分步提示生成，而非偏好响应则通过图像损坏或误导性提示产生。为进一步提升模型对提取视觉信息的推理能力，我们让模型复用少量现有的指令微调数据，并将其自生成的图像描述附加至提示中。我们在七个不同基准测试上验证了STIC的有效性，结果表明该方法在比现有方法少用70%监督微调数据的情况下，平均性能显著提升4.0%。进一步研究探讨了STIC的各个组成部分，并凸显了其利用海量未标注图像进行自训练的潜力。代码与数据均已公开。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/