The abilities of large language models (LLMs) have recently progressed to unprecedented levels, paving the way to novel applications in a wide variety of areas. In computer vision, LLMs can be used to prime vision-language tasks such image captioning and visual question answering when coupled with pre-trained vision backbones. While different approaches have been explored to interface LLMs with ``perceptual backbones'' that process, e.g., visual or audio data, they are often explored for different tasks, different datasets, and using different perceptual backbones and language models, hindering direct comparison of the interfacing mechanisms. To remedy this lack of comparability between methods, we present an extensive experimental evaluation of different interfacing mechanisms, across multiple tasks (including image, video, and audio captioning as well as visual question answering), datasets and backbones, paying special attention to low-data settings. We find improved performance using existing mechanisms over state-of-the-art results, and identify a new interfacing mechanism that yields (near) optimal results across different tasks, while obtaining a 4x reduction in training time.
翻译:大语言模型(LLM)的能力最近已发展到前所未有的水平,为众多领域的新应用铺平了道路。在计算机视觉领域,当与预训练的视觉主干网络结合时,LLM可用于启动图像描述和视觉问答等视觉语言任务。尽管已有多种方法探索如何将LLM与处理视觉或音频数据的"感知主干网络"相连接,但这些方法往往针对不同任务、不同数据集,并使用不同的感知主干网络和语言模型进行探索,导致难以直接比较不同的接口机制。为解决各方法间缺乏可比性的问题,我们开展了系统的实验评估,涵盖多种任务(包括图像、视频和音频描述以及视觉问答)、数据集和主干网络,特别关注低数据量场景。研究发现,现有机制在超越当前最优结果方面展现出更优性能,并确定了一种新型接口机制,该机制能在不同任务上取得(接近)最优结果,同时将训练时间缩短四倍。