The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM.
翻译:大型视觉-语言模型(LVLM)显著提升了视觉-语言理解领域多项下游任务的性能。现有方法通常将图像和视频编码至分离的特征空间,再将其作为输入馈入大语言模型。然而,由于缺乏图像与视频的统一分词机制——即投影前的对齐缺失——大语言模型(LLM)难以从若干劣质投影层中学习多模态交互。本文通过将视觉表征统一至语言特征空间,推动基础LLM向统一LVLM方向发展。由此,我们构建了简洁而稳健的LVLM基线模型Video-LLaVA,该模型通过图像-视频混合数据集进行学习,实现两类模态的相互增强。Video-LLaVA在涵盖5个图像问答数据集和4个图像基准工具的9项图像基准测试中均取得卓越性能。此外,在MSRVTT、MSVD、TGIF和ActivityNet数据集上,我们的Video-LLaVA分别以5.8%、9.9%、18.6%和10.1%的绝对优势超越Video-ChatGPT。特别值得关注的是,大量实验表明Video-LLaVA在统一视觉表征框架下实现图像与视频的互惠互利,其性能超越专为单模态设计的模型。本研究旨在为LLM多模态输入提供有益启示。