Video spatial reasoning requires accumulating viewpoint-dependent evidence over time while retaining information useful to the question being asked. Existing spatial video-language models improve geometric perception and long-range context modeling, but often treat memory as a generic temporal cache, which can introduce redundant or irrelevant geometry and weaken long-horizon reasoning. We propose \textbf{\ours}, a question-guided geometric memory framework for video spatial reasoning. \ours injects camera-conditioned geometry into visual tokens and maintains two complementary memories: a Fine-Grained Context Bank for recent dense features and camera states, and a Semantic-Geometric Evidence Bank for compact long-range evidence. Each candidate frame is scored by the product of Q-Former-based question relevance and novelty with respect to the retained bank; this score is stored and reused during reading, while a capacity-based replacement rule keeps the bank compact. During reasoning, both memories are read before update and adaptively fused with the current frame representation. Experiments on VSI-Bench and VSTI-Bench show that \ours achieves state-of-the-art performance among evaluated spatial reasoning models, validating the effectiveness of question-guided geometric memory. Ablations further verify the contribution of the proposed evidence scoring mechanism.
翻译:视频空间推理需要在随时间累积视角相关性证据的同时,保留对问题有用的信息。现有空间视频语言模型提升了几何感知与长程上下文建模能力,但常将记忆视为通用临时缓存,引入冗余或无关几何信息,削弱长程推理能力。本文提出\textbf{\ours}——一种面向问题的几何记忆框架,用于视频空间推理。\ours将相机条件几何信息注入视觉标记,并维护两种互补记忆:用于近期密集特征与相机状态的细粒度上下文库,以及用于紧凑长程证据的语义几何证据库。每个候选帧通过基于Q-Formater的问题相关性评分与相对于已保留库的新颖性评分之积进行评估;该分数在读取时被存储并复用,同时基于容量的替换规则保持库的紧凑性。推理过程中,两种记忆在更新前被读取,并与当前帧表征自适应融合。在VSI-Bench与VSTI-Bench上的实验表明,\ours在评估的空间推理模型中取得了最先进性能,验证了面向问题的几何记忆的有效性。消融实验进一步证实了所提证据评分机制的贡献。