Large-scale video diffusion models often fail to preserve 3D structure over time, causing geometric drift and implausible motion under viewpoint changes. Existing methods usually enforce geometric consistency by using explicit geometry reconstructions, such as depth maps, point clouds, or reconstructed 3D structures, to define conditions, supervision, or reward signals, making the generator sensitive to errors from upstream geometry pipelines. We propose VideoWeave, a latent-space post-training framework that uses implicit geometry-model features to constrain the generative distribution, providing a more flexible and non-rigid form of guidance that mitigates the impact of reconstruction errors from geometry models. Specifically, VideoWeave adapts these features into geometry latents and jointly models them with video latents in a shared denoising space, allowing geometry to shape the generative distribution during training. To support this process, we build GeoVid-80K, an 80K-video dataset with paired appearance and geometry representations. Experiments on text-to-video and image-to-video generation show that VideoWeave improves geometric coherence while preserving strong visual quality. VideoWeave project page at https://videoweave.github.io/
翻译:大规模视频扩散模型往往难以在时间维度上保持三维结构,导致视角变化时出现几何漂移和不可靠的运动。现有方法通常利用显式几何重建(如深度图、点云或重建的三维结构)来定义条件、监督信号或奖励机制,从而强制执行几何一致性,这使得生成器易受上游几何管道误差的影响。我们提出VideoWeave,这是一种基于隐空间的后训练框架,通过隐式几何模型特征约束生成分布,提供更灵活且非刚性的引导形式,缓解几何模型重建误差的影响。具体而言,VideoWeave将这些特征适配为几何隐变量,并与视频隐变量在共享去噪空间中联合建模,使几何信息在训练过程中塑造生成分布。为支持这一过程,我们构建了GeoVid-80K数据集,包含8万个配对表观与几何表征的视频。在文本生成视频和图像生成视频的实验表明,VideoWeave在保持强视觉质量的同时提升了几何一致性。VideoWeave项目页面详见https://videoweave.github.io/