With the burgeoning growth of online video platforms and the escalating volume of video content, the demand for proficient video understanding tools has intensified markedly. With Large Language Models (LLMs) showcasing remarkable capabilities in key language tasks, this survey provides a detailed overview of the recent advancements in video understanding harnessing the power of LLMs (Vid-LLMs). The emergent capabilities of Vid-LLMs are surprisingly advanced, particularly their ability for open-ended spatial-temporal reasoning combined with commonsense knowledge, suggesting a promising path for future video understanding. We examine the unique characteristics and capabilities of Vid-LLMs, categorizing the approaches into four main types: LLM-based Video Agents, Vid-LLMs Pretraining, Vid-LLMs Instruction Tuning, and Hybrid Methods. Furthermore, this survey also presents a comprehensive study of the tasks and datasets for Vid-LLMs, along with the methodologies employed for evaluation. Additionally, the survey explores the expansive applications of Vid-LLMs across various domains, thereby showcasing their remarkable scalability and versatility in addressing challenges in real-world video understanding. Finally, the survey summarizes the limitations of existing Vid-LLMs and the directions for future research. For more information, we recommend readers visit the repository at https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding.
翻译:随着在线视频平台的蓬勃发展和视频内容量的急剧攀升,对高效视频理解工具的需求显著增强。鉴于大语言模型(LLMs)在关键语言任务中展现出卓越能力,本综述详细回顾了近年来借助LLMs推动视频理解领域的最新进展(Vid-LLMs)。Vid-LLMs涌现出的能力令人惊讶地先进,特别是其结合常识知识进行开放式时空推理的能力,这为未来视频理解指明了一条颇具前景的路径。我们考察了Vid-LLMs的独特特性与能力,将其方法划分为四大主要类型:基于LLM的视频智能体、Vid-LLMs预训练、Vid-LLMs指令微调以及混合方法。此外,本综述还系统研究了Vid-LLMs所涉及的任务与数据集,以及用于评估的方法论。同时,本文探讨了Vid-LLMs在多个领域的广泛应用,展示了其在解决真实世界视频理解挑战时卓越的可扩展性与多功能性。最后,本综述总结了现有Vid-LLMs的局限性及未来研究方向。欲了解更多信息,建议读者访问仓库:https://github.com/yunlong10/Awesome-LLMs-for-Video-Understanding。