In the past several years there has been an explosion of available models for vision-language tasks. Unfortunately, the literature still leaves open a number of questions related to best practices in designing and training such models. In this paper we seek to answer several questions related to the pretraining of vision-language encoders through meta-analysis. In our first set of experiments, we show that we can save significant compute at no cost to downstream performance, by freezing large parts of vision-language models during pretraining. In our second set of experiments we examine the effect of basing a VL transformer on a vision model versus a text model. Additionally, we introduce a VL modeling platform called Renaissance that we use to conduct all of the experiments. This program offers a great deal of flexibility in creating, training and evaluating transformer encoders for VL modeling. The source code for Renaissance can be found at https://github.com/bsu-slim/renaissance.
翻译:过去几年中,可用于视觉-语言任务的模型呈现爆炸式增长。然而,现有文献在设计与训练此类模型的最佳实践方面仍存在诸多未解问题。本文通过元分析方法,旨在探讨与视觉-语言编码器预训练相关的若干核心问题。在第一组实验中,我们证明通过在预训练期间冻结视觉-语言模型的大部分参数,可在不影响下游任务性能的前提下显著节省计算资源。在第二组实验中,我们研究了基于视觉模型与基于文本模型构建视觉-语言Transformer的差异。此外,我们推出了名为Renaissance的视觉-语言建模平台,所有实验均基于该平台完成。该平台为创建、训练和评估视觉-语言建模的Transformer编码器提供了高度灵活性。Renaissance的源代码可在https://github.com/bsu-slim/renaissance获取。