Vision-language models (VLMs) demonstrate strong image-level scene understanding but often lack persistent memory, explicit spatial representations, and computational efficiency when reasoning over long video sequences. We present VL-KnG, a training-free framework that constructs spatiotemporal knowledge graphs from monocular video, bridging fine-grained scene graphs and global topological graphs without 3D reconstruction. VL-KnG processes video in chunks, maintains persistent object identity via LLM-based Spatiotemporal Object Association (STOA), and answers queries via Graph-Enhanced Retrieval (GER), a hybrid of GraphRAG subgraph retrieval and SigLIP2 visual grounding. Once built, the knowledge graph eliminates the need to re-process video at query time, enabling constant-time inference regardless of video length. Evaluation across three benchmarks, OpenEQA, NaVQA, and WalkieKnowledge (our newly introduced benchmark), shows that VL-KnG matches or surpasses frontier VLMs on embodied scene understanding tasks at significantly lower query latency, with explainable, graph-grounded reasoning. Real-world robot deployment confirms practical applicability with constant-time scaling.
翻译:视觉-语言模型在图像级场景理解中表现出色,但在处理长视频序列时往往缺乏持久记忆、显式空间表示和计算效率。我们提出VL-KnG——一种无需训练即可从单目视频构建时空知识图谱的框架,该框架在不进行3D重建的情况下,将细粒度场景图与全局拓扑图相衔接。VL-KnG以分块方式处理视频,通过基于大语言模型的时空对象关联(STOA)维持持久物体身份,并借助图增强检索(GER)——一种结合GraphRAG子图检索与SigLIP2视觉定位的混合方法——来回答查询。知识图谱构建完成后,查询时无需重新处理视频,无论视频长度如何均可实现恒定时间推理。在OpenEQA、NaVQA和WalkieKnowledge(我们新引入的基准测试)三个基准上的评估表明,VL-KnG在具身场景理解任务中以显著更低的查询延迟达到或超越前沿视觉-语言模型性能,并具备可解释的图表驱动推理能力。实际机器人部署验证了其具有恒定时间扩展特性的实用适用性。