Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.
翻译:近年来,研究者们尝试探索大语言模型(LLM)处理视频的能力,并提出了多种视频LLM模型。然而,LLM在处理视频定位(Video Grounding,VG)任务方面的能力——该重要时间相关视频任务要求模型精确定位视频中与给定文本查询匹配的时间段起止时间戳——在现有文献中仍不明确且未经探索。为填补这一空白,本文提出LLM4VG基准,系统评估不同LLM在视频定位任务上的表现。基于所提出的LLM4VG,我们设计大规模实验,检验两类视频LLM模型的视频定位能力:(i)在文本-视频对数据上训练的视频LLM模型(记为VidLLM),以及(ii)与预训练视觉描述模型(如视频/图像描述生成模型)相结合的LLM模型。我们提出提示方法,将VG指令与不同生成器的描述结果整合,包括用于直接视觉描述的基于描述文本的生成器,以及用于信息增强的基于视觉问答的生成器。此外,我们对各类VidLLM进行系统性比较,并探究视觉模型、LLM、提示设计等不同选择的影响。实验评估得出两项结论:(i)现有VidLLM距离实现令人满意的视频定位性能仍有较大差距,需引入更多时间相关视频任务以进一步微调这些模型;(ii)LLM与视觉模型的结合初步展现出视频定位能力,通过借助更可靠的模型以及进一步优化提示指令,其性能提升潜力巨大。