Accurate interactive camera control is essential for video-based world models, but most existing approaches learn camera motion implicitly, leading to inaccurate control under out-of-distribution trajectories. Explicit geometric conditioning improves controllability, but existing methods are non-autoregressive and rely on a static 3D cache built from an initial frame, which becomes ineffective once the viewpoint moves beyond the original frustum. We propose GeoStream, a framework that enables precise metric-scale camera control in autoregressive streaming video generation. Our method maintains a self-refreshing 3D cache that is periodically updated online from the model's own outputs: we estimate depth from the most recently generated frame, unproject to 3D, and reproject into the target view to produce point reprojections as geometric conditioning for subsequent synthesis. By the same principle, the conditioning seen during training is also rendered from the student's own generated frames, yielding a fully on-policy distillation that naturally aligns the train and inference conditioning distributions. Unlike prior work that uses off-policy condition noising, our approach trains the model against the exact error distribution it encounters at inference, mitigating both standard autoregressive drift and the second-order geometric feedback loop that arises when the cache itself is derived from generated outputs. Quantitative and qualitative results show that our approach substantially improves camera controllability.
翻译:精确的交互式相机控制对于基于视频的世界模型至关重要,但现有方法大多隐式学习相机运动,导致在分布外的轨迹下控制不准确。显式几何条件约束能提升可控性,但现有方法是非自回归的,且依赖从初始帧构建的静态3D缓存,一旦视点移出原始视锥范围,该缓存即失效。我们提出GeoStream框架,该框架能够在自回归流式视频生成中实现精确的度量尺度相机控制。我们的方法维护一个自刷新式3D缓存,该缓存从模型自身输出中在线周期性更新:我们从最新生成帧估算深度,将其反投影至3D空间,再重投影至目标视角,生成点重投影作为后续合成的几何条件。基于相同原理,训练过程中所见的条件约束也通过学生模型自身生成的帧进行渲染,从而实现完全在策略的蒸馏,自然对齐训练与推理时的条件分布。与先前采用离策略条件噪声的方法不同,我们的方法针对模型在推理时实际遭遇的精确误差分布进行训练,既缓解了标准自回归漂移,又解决了因缓存源自生成输出而产生的二阶几何反馈回路。定量与定性结果表明,我们的方法显著提升了相机可控性。