The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}
翻译:大型视觉语言模型(LVLM)提升了视觉语言理解中各种下游任务的性能。大多数现有方法将图像和视频编码到独立的特征空间,然后将其作为输入馈送到大型语言模型中。然而,由于缺乏对图像和视频的统一标记化,即投影前未对齐,大型语言模型(LLM)难以从几个性能较差的投影层学习多模态交互。在本工作中,我们将视觉表示统一到语言特征空间中,以推动基础LLM向统一的LVLM发展。因此,我们建立了一个简单而鲁棒的LVLM基线模型——Video-LLaVA,它从图像和视频的混合数据集中学习,使两者相互增强。Video-LLaVA在广泛的9个图像基准测试中取得了卓越性能,涵盖5个图像问答数据集和4个图像基准工具包。此外,我们的Video-LLaVA在MSRVTT、MSVD、TGIF和ActivityNet基准上分别比Video-ChatGPT高出5.8%、9.9%、18.6%和10.1%。值得注意的是,大量实验证明,Video-LLaVA通过统一的视觉表示使图像和视频相互受益,其性能优于专门为图像或视频设计的模型。我们希望这项工作能为LLM的多模态输入提供有益的见解。代码地址:\href{https://github.com/PKU-YuanGroup/Video-LLaVA}