Sparse camera-conditioned image-to-video generation presents a pivotal challenge: synthesizing geometrically consistent 3D motion from minimal pose cues. Existing methods, which largely rely on dense supervision or naive interpolation, suffer from severe pose drift and motion discontinuities due to the lack of robust 3D priors. In this paper, we introduce CamGeo, a novel framework that distills rich 3D geometric knowledge from a pre-trained video-to-3D model (VGGT) directly into the diffusion backbone. To achieve this without incurring inference latency, we propose a training-only distillation strategy. Specifically, CamGeo incorporates: (1) keyframe trajectory distillation that enforces cycle-consistency with sparse input poses, (2) cross-frame consistency distillation with both camera trajectory and depth constraints to generate consistent structure across unsupervised frames, and (3) a three-stage coarse-to-fine curriculum learning, progressively scales geometric complexity, from global structure coherence to fine-grained refinement, achieving stable optimization. Extensive experiments demonstrate that CamGeo achieves consistent improvements under various sparsity ratios.
翻译:稀疏相机条件下的图像到视频生成面临一个关键挑战:如何从极少量姿态线索中合成几何一致的三维运动。现有方法多依赖密集监督或朴素插值,因缺乏鲁棒的三维先验而饱受姿态漂移和运动不连续性之困。本文提出CamGeo——一种将预训练视频到三维模型(VGGT)中丰富的三维几何知识直接蒸馏至扩散骨干网络的新型框架。为避免引入推理延迟,我们提出仅训练阶段的蒸馏策略。具体而言,CamGeo包含:(1)关键帧轨迹蒸馏,通过与稀疏输入姿态的循环一致性约束强化几何一致性;(2)跨帧一致性蒸馏,联合相机轨迹与深度约束,在无监督帧间生成一致结构;(3)三阶段由粗到精的课程学习,逐步提升几何复杂度——从全局结构连贯性到细粒度精修——实现稳定优化。大量实验表明,CamGeo在不同稀疏度比率下均取得一致性改进。