Vision-Language Transformers can be learned without low-level human labels (e.g. class labels, bounding boxes, etc). Existing work, whether explicitly utilizing bounding boxes or patches, assumes that the visual backbone must first be trained on ImageNet class prediction before being integrated into a multimodal linguistic pipeline. We show that this is not necessary and introduce a new model Vision-Language from Captions (VLC) built on top of Masked Auto-Encoders that does not require this supervision. In fact, in a head-to-head comparison between ViLT, the current state-of-the-art patch-based vision-language transformer which is pretrained with supervised object classification, and our model, VLC, we find that our approach 1. outperforms ViLT on standard benchmarks, 2. provides more interpretable and intuitive patch visualizations, and 3. is competitive with many larger models that utilize ROIs trained on annotated bounding-boxes.
翻译:视觉-语言Transformer可以在无低级人工标注(如类别标签、边界框等)的情况下进行学习。现有工作,无论是显式使用边界框还是图像块,均假设视觉骨干网络必须先通过ImageNet类别预测进行预训练,再整合到多模态语言处理流程中。我们证明这一步骤并非必需,并引入了一种基于掩码自编码器构建的新模型——基于标题的视觉-语言模型(VLC),该模型无需此类监督。事实上,在对当前最先进的基于图像块的视觉-语言Transformer ViLT(该模型使用有监督目标分类进行预训练)与我们的VLC模型进行直接比较时,我们发现我们的方法:1. 在标准基准测试中优于ViLT;2. 提供更可解释且更直观的图像块可视化效果;3. 与许多使用基于标注边界框的感兴趣区域(ROI)的大型模型具有竞争力。