Maintaining spatial world consistency over long horizons remains a central challenge for camera-controllable video generation. Existing memory-based approaches often condition generation on globally reconstructed 3D scenes by rendering anchor videos from the reconstructed geometry in the history. However, reconstructing a global 3D scene from multiple views inevitably introduces cross-view misalignment, as pose and depth estimation errors cause the same surfaces to be reconstructed at slightly different 3D locations across views. When fused, these inconsistencies accumulate into noisy geometry that contaminates the conditioning signals and degrades generation quality. We introduce AnchorWeave, a memory-augmented video generation framework that replaces a single misaligned global memory with multiple clean local geometric memories and learns to reconcile their cross-view inconsistencies. To this end, AnchorWeave performs coverage-driven local memory retrieval aligned with the target trajectory and integrates the selected local memories through a multi-anchor weaving controller during generation. Extensive experiments demonstrate that AnchorWeave significantly improves long-term scene consistency while maintaining strong visual quality, with ablation and analysis studies further validating the effectiveness of local geometric conditioning, multi-anchor control, and coverage-driven retrieval.
翻译:在相机可控视频生成中,保持长时程的空间世界一致性仍是一个核心挑战。现有的基于记忆的方法通常通过从历史重建几何中渲染锚点视频,以全局重建的三维场景作为生成条件。然而,从多视角重建全局三维场景不可避免地会引入跨视角错位问题,因为姿态和深度估计误差会导致同一表面在不同视角下被重建到略微不同的三维位置。这些不一致性在融合时会累积成噪声几何,污染条件信号并降低生成质量。我们提出了AnchorWeave,一种记忆增强的视频生成框架,它用多个干净的局部几何记忆替代单一错位的全局记忆,并学习协调其跨视角不一致性。为此,AnchorWeave执行与目标轨迹对齐的覆盖驱动局部记忆检索,并在生成过程中通过多锚点编织控制器集成所选的局部记忆。大量实验表明,AnchorWeave显著改善了长时程场景一致性,同时保持了强大的视觉质量,消融和分析研究进一步验证了局部几何条件化、多锚点控制以及覆盖驱动检索的有效性。