Extending image-based Large Multimodal Models (LMMs) to videos is challenging due to the inherent complexity of video data. The recent approaches extending image-based LMMs to videos either lack the grounding capabilities (e.g., VideoChat, Video-ChatGPT, Video-LLaMA) or do not utilize the audio-signals for better video understanding (e.g., Video-ChatGPT). Addressing these gaps, we propose PG-Video-LLaVA, the first LMM with pixel-level grounding capability, integrating audio cues by transcribing them into text to enrich video-context understanding. Our framework uses an off-the-shelf tracker and a novel grounding module, enabling it to spatially localize objects in videos following user instructions. We evaluate PG-Video-LLaVA using video-based generative and question-answering benchmarks and introduce new benchmarks specifically designed to measure prompt-based object grounding performance in videos. Further, we propose the use of Vicuna over GPT-3.5, as utilized in Video-ChatGPT, for video-based conversation benchmarking, ensuring reproducibility of results which is a concern with the proprietary nature of GPT-3.5. Our framework builds on SoTA image-based LLaVA model and extends its advantages to the video domain, delivering promising gains on video-based conversation and grounding tasks. Project Page: https://github.com/mbzuai-oryx/Video-LLaVA
翻译:将基于图像的大型多模态模型(LMM)扩展到视频领域极具挑战性,这主要源于视频数据固有的复杂性。现有将图像LMM扩展至视频的方法,要么缺乏空间定位能力(如VideoChat、Video-ChatGPT、Video-LLaMA),要么未能利用音频信号提升视频理解效果(如Video-ChatGPT)。针对上述不足,我们提出PG-Video-LLaVA——首个具备像素级空间定位能力的LMM,通过将音频转录为文本整合音频信息,以增强视频场景理解。本框架采用现成的跟踪器与新型定位模块,使其能够根据用户指令在视频中空间定位目标物体。我们通过基于视频的生成任务与问答基准评估PG-Video-LLaVA,并专门设计视频中基于提示的目标定位性能评测基准。此外,在视频对话基准测试中,我们采用Vicuna替代Video-ChatGPT使用的GPT-3.5,确保实验结果的可复现性——这恰恰是GPT-3.5的专有属性带来的问题。本框架基于当前最优的图像LLaVA模型构建,将其优势拓展至视频领域,在视频对话与定位任务中取得显著性能提升。项目主页:https://github.com/mbzuai-oryx/Video-LLaVA