Watch, Remember, Reason: Human-View Video Understanding with MLLMs

Jiahao Meng,Yue Tan,Qi Xu,Kuan Gao,Weisong Liu,Yanwei Li,Jason Li,Lingdong Kong,Haochen Wang,Qianyu Zhou,Jiangning Zhang,Guangliang Cheng,Yunhai Tong,Lu Qi,Minghsuan Yang

Video understanding is being rapidly transformed by multimodal large language models (MLLMs), as research moves from short clips to long, multimodal, and knowledge-intensive video scenarios. These scenarios require models to handle sparse evidence, long-range dependencies, multimodal alignment, and reliable inference under limited computational budgets. This work presents a human-view perspective on LLM-based video understanding, organized around three functional abilities: watching, remembering, and reasoning. Rather than treating video tasks as isolated benchmarks, this view provides a unified structure for analyzing how video MLLMs acquire evidence, preserve context, and produce grounded outputs. We introduce a formulation that characterizes video understanding systems by their perceptual representations, memory states, reasoning traces, and final predictions. Based on this formulation, we identify challenges in spatio-temporal perception, efficient long-video processing, memory modeling, streaming understanding, and faithful reasoning. Representative methods are organized by their roles in video MLLM systems. Watching covers fine-grained, comprehensive, audio-visual, and efficient perception. Remembering includes offline and streaming memory, while reasoning covers text-only reasoning and thinking with videos. We further examine application domains such as egocentric, sports, instructional, medical, and narrative videos, and cover training datasets and evaluation benchmarks across task types, supervision formats, modalities, and capability dimensions. Finally, we outline open problems and future directions for scalable, memory-aware, and evidence-grounded video intelligence. Related works will be continuously traced at https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding.

翻译：视频理解正被多模态大语言模型快速变革，研究从短片段向长视频、多模态及知识密集型场景延伸。这类场景要求模型在有限计算资源下处理稀疏证据、长程依赖、多模态对齐及可靠推理。本文提出以人为中心的视角，围绕三大核心能力——观看、记忆与推理——对基于大语言模型的视频理解进行了系统梳理。不同于将视频任务视为孤立基准，本视角提供了统一分析框架，用于理解视频多模态大语言模型如何获取证据、保持上下文并生成可靠输出。我们引入一种形式化描述，将视频理解系统表征为感知表示、记忆状态、推理轨迹与最终预测四个模块。基于这一框架，我们识别出包括时空感知、高效长视频处理、记忆建模、流式理解及可信推理在内的核心挑战。代表性方法按其视频多模态大语言模型系统中的角色进行组织：观看涵盖细粒度、全面感知、视听感知与高效感知；记忆包括离线与流式记忆；推理包括纯文本推理与视频辅助推理。我们进一步探讨了第一人称视角、体育、教学、医疗及叙事视频等应用领域，并梳理了跨任务类型、监督格式、模态及能力维度的训练数据集与评估基准。最后，我们展望了可扩展、记忆感知且注重证据的视频智能面临的开放问题与未来方向。相关研究工作将持续追踪于 https://github.com/marinero4972/Awesome-HumanView-VideoUnderstanding。