Sparse camera-conditioned image-to-video generation presents a pivotal challenge: synthesizing geometrically consistent 3D motion from minimal pose cues. Existing methods, which largely rely on dense supervision or naive interpolation, suffer from severe pose drift and motion discontinuities due to the lack of robust 3D priors. In this paper, we introduce CamGeo, a novel framework that distills rich 3D geometric knowledge from a pre-trained video-to-3D model (VGGT) directly into the diffusion backbone. To achieve this without incurring inference latency, we propose a training-only distillation strategy. Specifically, CamGeo incorporates: (1) keyframe trajectory distillation that enforces cycle-consistency with sparse input poses, (2) cross-frame consistency distillation with both camera trajectory and depth constraints to generate consistent structure across unsupervised frames, and (3) a three-stage coarse-to-fine curriculum learning, progressively scales geometric complexity, from global structure coherence to fine-grained refinement, achieving stable optimization. Extensive experiments demonstrate that CamGeo achieves consistent improvements under various sparsity ratios.
翻译:稀疏相机条件下的图像到视频生成面临一个关键挑战:如何从极少的姿态线索中合成几何一致的三维运动。现有方法大多依赖密集监督或朴素插值,因缺乏鲁棒的三维先验而饱受严重的姿态漂移和运动不连续性问题。本文提出CamGeo,一种新型框架,它直接从预训练的视频到三维模型(VGGT)中将丰富的三维几何知识蒸馏到扩散骨干网络中。为避免引入推理延迟,我们提出一种仅训练阶段的蒸馏策略。具体而言,CamGeo包含:(1)关键帧轨迹蒸馏,强制与稀疏输入姿态保持循环一致性;(2)跨帧一致性蒸馏,同时施加相机轨迹和深度约束,以在无监督帧间生成一致的结构;(3)三阶段由粗到精的课程学习,逐步提升几何复杂度,从全局结构连贯性到细粒度精化,实现稳定优化。大量实验表明,CamGeo在各种稀疏比例下均能取得一致的性能提升。