Answering navigation-relevant questions over long egocentric videos requires retrieving and organizing evidence distributed across distant temporal moments while maintaining spatial and contextual consistency. Although long-context vision--language models can achieve strong answer quality, they are computationally expensive for long trajectories and inefficient for repeated querying. Recent graph-based approaches such as VL-KnG address this challenge through persistent spatio-temporal knowledge graphs, but graph-centric retrieval alone may underrepresent broader temporal continuity and contextual cues. We present VL-MemKnG, a hybrid memory framework that extends VL-KnG by combining a spatio-temporal knowledge graph with persistent segment-level contextual memory. The knowledge graph captures structured relational information and long-range object associations, while segment-level memory preserves broader temporal context for long-horizon evidence retrieval. A hybrid retrieval-and-reasoning module jointly operates over both memory representations to produce evidence-grounded answers and temporally organized supporting evidence. We also introduce WalkieKnowledgeT+, an extension of WalkieKnowledge for long-horizon navigation-oriented video question answering. The benchmark includes temporally distributed reasoning tasks requiring evidence aggregation across multiple non-cooccurring moments. On WalkieKnowledgeT+, VL-MemKnG improves Top-1 retrieval accuracy from 58% to 67% and Recall@1 from 34.50% to 40.55%, outperforming all compared methods, including Gemini 2.5 Pro and Qwen 3.5+. The gains are particularly pronounced on temporal-global and temporally scattered aggregation questions, demonstrating the benefits of combining structured relational memory with segment-level contextual memory while maintaining efficient query-time inference.
翻译:针对基于长程第一人称视频的导航相关问答,需要从分散于不同远距离时间节点的证据中检索并整合信息,同时保持空间与上下文一致性。尽管长上下文视觉-语言模型能实现较高的回答质量,但其对长轨迹计算成本高昂且不适用于重复查询。近期基于图的方案(如VL-KnG)通过持久化时空知识图谱缓解此问题,但单纯依赖图检索可能弱化对整体时间连续性与上下文线索的表征。我们提出VL-MemKnG——一种扩展VL-KnG的混合记忆框架,融合时空知识图谱与持久化片段级上下文记忆。知识图谱捕获结构化关系信息与长程物体关联,而片段级记忆保留用于长时域证据检索的广泛时间上下文。混合检索-推理模块协同操作两种记忆表征,生成基于证据的答案与时序化支撑证据。同时提出WalkieKnowledgeT+——面向长时域导航导向视频问答的WalkieKnowledge扩展基准,该基准包含需跨多个非共现时刻进行证据聚合的时间分布式推理任务。在WalkieKnowledgeT+上,VL-MemKnG将Top-1检索准确率从58%提升至67%,召回率@1从34.50%提升至40.55%,超越所有对比方法(含Gemini 2.5 Pro与Qwen 3.5+)。对时间全局型与时序分散聚合型问题的性能提升尤为显著,验证了结构化关系记忆与片段级上下文记忆协同整合的同时保持高效查询推理的优势。