Recent advances in video generation enable a new paradigm for 3D scene creation: generating camera-controlled videos that simulate scene walkthroughs, then lifting them to 3D via feed-forward reconstruction techniques. This generative reconstruction approach combines the visual fidelity and creative capacity of video models with 3D outputs ready for real-time rendering and simulation. Scaling to large, complex environments requires 3D-consistent video generation over long camera trajectories with large viewpoint changes and location revisits, a setting where current video models degrade quickly. Existing methods for long-horizon generation are fundamentally limited by two forms of degradation: spatial forgetting and temporal drifting. As exploration proceeds, previously observed regions fall outside the model's temporal context, forcing the model to hallucinate structures when revisited. Meanwhile, autoregressive generation accumulates small synthesis errors over time, gradually distorting scene appearance and geometry. We present Lyra 2.0, a framework for generating persistent, explorable 3D worlds at scale. To address spatial forgetting, we maintain per-frame 3D geometry and use it solely for information routing -- retrieving relevant past frames and establishing dense correspondences with the target viewpoints -- while relying on the generative prior for appearance synthesis. To address temporal drifting, we train with self-augmented histories that expose the model to its own degraded outputs, teaching it to correct drift rather than propagate it. Together, these enable substantially longer and 3D-consistent video trajectories, which we leverage to fine-tune feed-forward reconstruction models that reliably recover high-quality 3D scenes.
翻译:视频生成技术的最新进展为三维场景创建带来了新范式:通过生成模拟场景漫游的相机控制视频,再经由前馈重建技术将其提升为3D内容。这种生成式重建方法将视频模型的视觉保真度与创意能力,与可直接用于实时渲染和模拟的3D输出相结合。要扩展到大型复杂环境,需要能够沿长相机轨迹、应对大视角变化和位置重访场景下保持三维一致性的视频生成——这正是当前视频模型快速退化的场景。现有长程生成方法根本上受限于两种退化形式:空间遗忘与时间偏移。随着探索推进,先前观测区域超出模型时序上下文范围,导致重访时模型不得不产生幻觉结构。同时,自回归生成过程中积累的小幅度合成误差随时间逐渐扭曲场景外观与几何。我们提出Lyra 2.0框架,用于生成大规模可持续探索的3D世界。为解决空间遗忘,我们维护逐帧三维几何,并将其仅用于信息路由——检索相关历史帧并建立与目标视角的密集对应关系——同时依赖生成先验进行外观合成。为解决时间偏移,我们采用自增强历史序列进行训练,使模型接触自身退化的输出,从而学习修正偏移而非传播误差。这些机制共同实现了显著更长的三维一致视频轨迹,并以此微调可可靠恢复高质量三维场景的前馈重建模型。