Spatial reasoning from egocentric videos is inherently challenging because the observable evidence is constrained by the camera trajectory. Existing methods rely on single-turn inference, forcing models to resolve geometric ambiguity through semantic priors rather than verifiable evidence. We argue that spatial reasoning should be revisitable: conclusions formed under limited evidence should remain open to revision when complementary viewpoints become available. Building on this insight, we propose Reason, then Re-reason (ReRe), a training-free, inference-time framework with two phases: in the Reason Phase, an MLLM forms a spatial hypothesis from the original video; in the Re-reason Phase, it verifies or revises the hypothesis by observing a synthesized novel-view video. To enable effective cross-view revisiting, we design a Geometry-to-Video pipeline that renders strategically complementary novel views from predicted 3D geometry. These views feature an elevated, oblique perspective with scene-spanning coverage, while preserving the MLLM's native video interface without architectural modifications. Extensive evaluations on VSI-Bench and STI-Bench demonstrate that ReRe substantially boosts open-source MLLMs to rival proprietary state-of-the-art performance. Project page: https://zhenjiemao.github.io/ReRe/
翻译:从第一人称视角视频中进行空间推理本质上具有挑战性,因为可观测证据受限于相机轨迹。现有方法依赖单次推理,迫使模型通过语义先验而非可验证证据来解决几何模糊性。我们认为空间推理应具有可回访性:基于有限证据形成的结论在获得补充视角时应保持可修正性。基于这一洞察,我们提出"推理,而后再推理"(ReRe)框架——一个无需训练、推理时运行的双阶段方法:在推理阶段,多模态大语言模型(MLLM)从原始视频形成空间假设;在再推理阶段,模型通过观察合成的新视角视频来验证或修正该假设。为实现有效的跨视角回访,我们设计了从几何到视频的流水线,基于预测的三维几何渲染具有策略性补充的新视角。这些视角呈现抬升的倾斜视角,具有覆盖场景的全景视野,同时保持MLLM原生视频接口无需架构修改。在VSI-Bench和STI-Bench上的广泛评估表明,ReRe显著提升了开源MLLM的性能,使其可与专有模型的最新成果相抗衡。项目页面:https://zhenjiemao.github.io/ReRe/