With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. Given the remarkable capabilities of Large Language Models (LLMs) in language and multimodal tasks, this survey provides a detailed overview of the recent advancements in video understanding harnessing the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended spatial-temporal reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into four main types: LLM-based Video Agents, Vid-LLMs Pretraining, Vid-LLMs Instruction Tuning, and Hybrid Methods. Furthermore, this survey presents a comprehensive study of the tasks, datasets, and evaluation methodologies for Vid-LLMs. Additionally, it explores the expansive applications of Vid-LLMs across various domains, highlighting their remarkable scalability and versatility in real-world video understanding challenges. Finally, it summarizes the limitations of existing Vid-LLMs and outlines directions for future research. For more information, readers are recommended to visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
翻译:随着在线视频平台的蓬勃发展和视频内容数量的急剧攀升,对高效视频理解工具的需求显著增强。鉴于大语言模型在语言及多模态任务中展现的卓越能力,本综述详细梳理了近期利用大语言模型进行视频理解(Vid-LLMs)的研究进展。Vid-LLMs涌现出的能力令人惊讶地先进,特别是其将开放式的时空推理与常识知识相结合的能力,这为未来的视频理解指明了有前景的方向。我们考察了Vid-LLMs的独特特征与能力,并将现有方法归纳为四大类:基于LLM的视频代理、Vid-LLMs预训练、Vid-LLMs指令微调以及混合方法。此外,本综述全面研究了Vid-LLMs的任务类型、数据集及评估方法,并探讨了其在多个领域的广泛应用,突显了其在真实世界视频理解挑战中惊人的可扩展性与多功能性。最后,总结了现有Vid-LLMs的局限性并概述了未来研究方向。更多信息,建议读者访问项目仓库:https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding。