We propose a new paradigm for long video understanding by treating a long video as a Neural Knowledge Representation (NKR). NKR represents video contents neither as a stream of tokens nor pre-organized databases, but as an individual small portion of network weights attached to the VLM backbone. The NKR weights are optimized to encapsulate the video's semantic content via a novel Agentic Knowledge Distillation (AKD) process, where an agent automatically synthesizes dense descriptions and question-answer pairs to distill the video's knowledge into the NKR. While AKD serves as a comprehensive, one-time encoding phase, the resulting NKR transforms the video into a portable, reusable asset. At inference, the lightweight NKR is mounted onto a frozen Vision-Language Model (VLM), enabling direct, query-based understanding without reloading or re-encoding the original video. This approach decouples video length from inference cost, offering high amortized efficiency for multi-turn video understanding. Experiments on the LVBench benchmark show our method achieves performance comparable to state-of-the-art approaches while reducing end-to-end latency by over two orders of magnitude, opening new possibilities for interactive long-video understanding.
翻译:我们提出了一种全新范式用于长视频理解,通过将长视频视为神经知识表征(NKR)。NKR既不以令牌流形式也不以预组织数据库形式表征视频内容,而是作为附着在VLM骨干网络上的小型网络权重子集。这些NKR权重通过一种新颖的智能体知识蒸馏(AKD)过程进行优化,该过程使智能体自动合成密集描述和问答对,将视频知识蒸馏至NKR中。AKD作为一次性全面编码阶段,其生成的NKR将视频转化为可移植、可复用的资产。在推理阶段,轻量级NKR被挂载至冻结的视觉语言模型(VLM)上,无需重新加载或重新编码原始视频即可实现直接的基于查询的理解。该方法将视频长度与推理成本解耦,为多轮视频理解提供了高摊销效率。在LVBench基准上的实验表明,我们的方法在将端到端延迟降低两个数量级以上的同时,实现了与最先进方法相当的性能,为交互式长视频理解开辟了新可能。