LLM4VG: Large Language Models Evaluation for Video Grounding

Recently, researchers have attempted to investigate the capability of LLMs in handling videos and proposed several video LLM models. However, the ability of LLMs to handle video grounding (VG), which is an important time-related video task requiring the model to precisely locate the start and end timestamps of temporal moments in videos that match the given textual queries, still remains unclear and unexplored in literature. To fill the gap, in this paper, we propose the LLM4VG benchmark, which systematically evaluates the performance of different LLMs on video grounding tasks. Based on our proposed LLM4VG, we design extensive experiments to examine two groups of video LLM models on video grounding: (i) the video LLMs trained on the text-video pairs (denoted as VidLLM), and (ii) the LLMs combined with pretrained visual description models such as the video/image captioning model. We propose prompt methods to integrate the instruction of VG and description from different kinds of generators, including caption-based generators for direct visual description and VQA-based generators for information enhancement. We also provide comprehensive comparisons of various VidLLMs and explore the influence of different choices of visual models, LLMs, prompt designs, etc, as well. Our experimental evaluations lead to two conclusions: (i) the existing VidLLMs are still far away from achieving satisfactory video grounding performance, and more time-related video tasks should be included to further fine-tune these models, and (ii) the combination of LLMs and visual models shows preliminary abilities for video grounding with considerable potential for improvement by resorting to more reliable models and further guidance of prompt instructions.

翻译：近年来，研究者们尝试探索大语言模型（LLM）处理视频的能力，并提出了多种视频LLM模型。然而，LLM在处理视频定位（Video Grounding，VG）任务方面的能力——该重要时间相关视频任务要求模型精确定位视频中与给定文本查询匹配的时间段起止时间戳——在现有文献中仍不明确且未经探索。为填补这一空白，本文提出LLM4VG基准，系统评估不同LLM在视频定位任务上的表现。基于所提出的LLM4VG，我们设计大规模实验，检验两类视频LLM模型的视频定位能力：（i）在文本-视频对数据上训练的视频LLM模型（记为VidLLM），以及（ii）与预训练视觉描述模型（如视频/图像描述生成模型）相结合的LLM模型。我们提出提示方法，将VG指令与不同生成器的描述结果整合，包括用于直接视觉描述的基于描述文本的生成器，以及用于信息增强的基于视觉问答的生成器。此外，我们对各类VidLLM进行系统性比较，并探究视觉模型、LLM、提示设计等不同选择的影响。实验评估得出两项结论：（i）现有VidLLM距离实现令人满意的视频定位性能仍有较大差距，需引入更多时间相关视频任务以进一步微调这些模型；（ii）LLM与视觉模型的结合初步展现出视频定位能力，通过借助更可靠的模型以及进一步优化提示指令，其性能提升潜力巨大。

相关内容

MoDELS

关注 0

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日