While visual language model architectures and training infrastructures advance rapidly, data curation remains under-explored where quantity and quality become a bottleneck. Existing work either crawls extra Internet data with a loose guarantee of quality or distills from black-box proprietary models, e.g., GPT-4V / Gemini that are API frequency and performance bounded. This work enables a VLM to improve itself via data enhancement, exploiting its generative nature. We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step to iteratively improve data quality and hence, model performance. In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data. Without any expensive human-in-the-loop annotation, we observe improvements in data quality and downstream accuracy boosts with three self-augmentation rounds -- a viable free lunch to the current VLM training recipe. When self-augmentation saturates, we augment the caption diversity by leveraging specialty skills picked up from instruction finetuning. We finetune VLM specialists from the self-augmented VLM with domain-specific experts, including spatial, grounding, and OCR, to fuse task-aware synthetic data into the pretraining stage. Data quality improvements and hallucination reductions are cross-checked by VLM (GPT-4V, Gemini) and human judges. Combining self-augmentation and specialist-augmented training, VILA$^2$ consistently improves the accuracy on a wide range of benchmarks over the prior art, producing a reusable pretraining dataset that is 300x more cost-efficient than human labeling.
翻译:尽管视觉语言模型的架构和训练基础设施发展迅速,但数据整理仍处于探索不足的状态,其数量和质量已成为瓶颈。现有工作要么以宽松的质量保证从互联网爬取额外数据,要么从黑盒专有模型(如受API调用频率和性能限制的GPT-4V/Gemini)中蒸馏知识。本研究通过利用视觉语言模型的生成特性,使其能够通过数据增强实现自我改进。我们提出了一种简单而有效的视觉语言模型增强方案,包含自增强步骤和专家增强步骤,以迭代提升数据质量,从而提高模型性能。在自增强步骤中,经过指令微调的视觉语言模型对其预训练的标注数据集进行重新描述,随后利用精炼数据从头开始重新训练。在无需任何昂贵的人工参与标注的情况下,我们通过三轮自增强观察到数据质量的提升以及下游任务准确率的增长——这为当前视觉语言模型训练方案提供了一种可行的“免费午餐”。当自增强效果趋于饱和时,我们通过利用指令微调获得的专项技能来增强标注多样性。我们从自增强后的视觉语言模型中微调出具备领域专长的专家模型(包括空间理解、指代定位和光学字符识别),将任务感知的合成数据融合到预训练阶段。数据质量的改进和幻觉现象的减少均通过视觉语言模型(GPT-4V、Gemini)与人工评估进行交叉验证。结合自增强与专家增强训练,VILA$^2$在广泛基准测试中持续超越现有技术,同时生成了可重复使用的预训练数据集,其成本效益相比人工标注提升300倍。