Current multimodal models, aimed at solving Vision and Language (V+L) tasks, predominantly repurpose Vision Encoders (VE) as feature extractors. While many VEs -- of different architectures, trained on different data and objectives -- are publicly available, they are not designed for the downstream V+L tasks. Nonetheless, most current work assumes that a \textit{single} pre-trained VE can serve as a general-purpose encoder. In this work, we focus on analysis and aim to understand whether the information stored within different VEs is complementary, i.e. if providing the model with features from multiple VEs can improve the performance on a target task, and how they are combined. We exhaustively experiment with three popular VEs on six downstream V+L tasks and analyze the attention and VE-dropout patterns. Our analyses suggest that diverse VEs complement each other, resulting in improved downstream V+L task performance, where the improvements are not due to simple ensemble effects (i.e. the performance does not always improve when increasing the number of encoders). We demonstrate that future VEs, which are not \textit{repurposed}, but explicitly \textit{designed} for V+L tasks, have the potential of improving performance on the target V+L tasks.
翻译:摘要:当前旨在解决视觉与语言(V+L)任务的多模态模型,主要将视觉编码器(VEs)重新用作特征提取器。尽管现有许多不同架构、在不同数据和目标上训练出的视觉编码器,但它们并非专为下游V+L任务设计。然而,多数现有研究假设单一的预训练视觉编码器可充当通用编码器。本研究聚焦于分析,旨在探究不同视觉编码器中存储的信息是否具有互补性——即向模型提供多个视觉编码器的特征能否提升目标任务的性能,以及这些特征如何被组合。我们针对三种主流视觉编码器,在六项下游V+L任务上进行了全面实验,并分析了注意力机制与视觉编码器丢弃模式。分析表明,多样化视觉编码器具有互补性,能提升下游V+L任务性能,且这种提升并非简单的集成效应(即增加编码器数量并不总是提升性能)。我们证明,未来若视觉编码器不再是被"重新利用",而是为V+L任务明确"设计",则有望进一步提升目标V+L任务的性能。