VILA$^2$: VILA Augmented VILA

While visual language model architectures and training infrastructures advance rapidly, data curation remains under-explored where quantity and quality become a bottleneck. Existing work either crawls extra Internet data with a loose guarantee of quality or distills from black-box proprietary models, e.g., GPT-4V / Gemini that are API frequency and performance bounded. This work enables a VLM to improve itself via data enhancement, exploiting its generative nature. We introduce a simple yet effective VLM augmentation scheme that includes a self-augment step and a specialist-augment step to iteratively improve data quality and hence, model performance. In the self-augment step, the instruction-finetuned VLM recaptions its pretraining caption datasets and then retrains from scratch leveraging refined data. Without any expensive human-in-the-loop annotation, we observe improvements in data quality and downstream accuracy boosts with three self-augmentation rounds -- a viable free lunch to the current VLM training recipe. When self-augmentation saturates, we augment the caption diversity by leveraging specialty skills picked up from instruction finetuning. We finetune VLM specialists from the self-augmented VLM with domain-specific experts, including spatial, grounding, and OCR, to fuse task-aware synthetic data into the pretraining stage. Data quality improvements and hallucination reductions are cross-checked by VLM (GPT-4V, Gemini) and human judges. Combining self-augmentation and specialist-augmented training, VILA$^2$ consistently improves the accuracy on a wide range of benchmarks over the prior art, producing a reusable pretraining dataset that is 300x more cost-efficient than human labeling.

翻译：尽管视觉语言模型的架构和训练基础设施发展迅速，但数据整理仍处于探索不足的状态，其数量和质量已成为瓶颈。现有工作要么以宽松的质量保证从互联网爬取额外数据，要么从黑盒专有模型（如受API调用频率和性能限制的GPT-4V/Gemini）中蒸馏知识。本研究通过利用视觉语言模型的生成特性，使其能够通过数据增强实现自我改进。我们提出了一种简单而有效的视觉语言模型增强方案，包含自增强步骤和专家增强步骤，以迭代提升数据质量，从而提高模型性能。在自增强步骤中，经过指令微调的视觉语言模型对其预训练的标注数据集进行重新描述，随后利用精炼数据从头开始重新训练。在无需任何昂贵的人工参与标注的情况下，我们通过三轮自增强观察到数据质量的提升以及下游任务准确率的增长——这为当前视觉语言模型训练方案提供了一种可行的“免费午餐”。当自增强效果趋于饱和时，我们通过利用指令微调获得的专项技能来增强标注多样性。我们从自增强后的视觉语言模型中微调出具备领域专长的专家模型（包括空间理解、指代定位和光学字符识别），将任务感知的合成数据融合到预训练阶段。数据质量的改进和幻觉现象的减少均通过视觉语言模型（GPT-4V、Gemini）与人工评估进行交叉验证。结合自增强与专家增强训练，VILA$^2$在广泛基准测试中持续超越现有技术，同时生成了可重复使用的预训练数据集，其成本效益相比人工标注提升300倍。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日