In recent times there has been a surge of multi-modal architectures based on Large Language Models, which leverage the zero shot generation capabilities of LLMs and project image embeddings into the text space and then use the auto-regressive capacity to solve tasks such as VQA, captioning, and image retrieval. We name these architectures as "bridge-architectures" as they project from the image space to the text space. These models deviate from the traditional recipe of training transformer based multi-modal models, which involve using large-scale pre-training and complex multi-modal interactions through co or cross attention. However, the capabilities of bridge architectures have not been tested on complex visual reasoning tasks which require fine grained analysis about the image. In this project, we investigate the performance of these bridge-architectures on the NLVR2 dataset, and compare it to state-of-the-art transformer based architectures. We first extend the traditional bridge architectures for the NLVR2 dataset, by adding object level features to faciliate fine-grained object reasoning. Our analysis shows that adding object level features to bridge architectures does not help, and that pre-training on multi-modal data is key for good performance on complex reasoning tasks such as NLVR2. We also demonstrate some initial results on a recently bridge-architecture, LLaVA, in the zero shot setting and analyze its performance.
翻译:近年来,基于大型语言模型的多模态架构激增,这些架构利用LLM的零样本生成能力,将图像嵌入投影到文本空间,并借助自回归能力解决视觉问答、图像描述和图像检索等任务。我们将此类架构称为“桥接架构”,因其实现了从图像空间到文本空间的投影。这些模型偏离了传统的基于Transformer的多模态模型训练范式——后者通常依赖大规模预训练及通过协同注意力或交叉注意力实现的复杂多模态交互。然而,桥接架构在需要精细图像分析的复杂视觉推理任务中的能力尚未得到验证。在本研究中,我们探究了此类桥接架构在NLVR2数据集上的表现,并将其与基于Transformer的最先进架构进行比较。我们首先通过添加对象级特征以促进细粒度对象推理,对传统桥接架构在NLVR2数据集上进行了扩展。分析表明,为桥接架构添加对象级特征并无增益,而多模态数据预训练才是实现NLVR2等复杂推理任务良好性能的关键。此外,我们还展示了近期桥接架构LLaVA在零样本设置下的初步实验结果,并分析了其性能表现。