While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, physical plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.
翻译:尽管近期视频扩散模型(VDMs)能生成视觉效果出色的结果,但其本质上难以保持三维结构一致性,常导致物体形变或空间漂移。我们假设这些缺陷源于标准去噪目标缺乏对几何连贯性的显式激励。为此,我们提出VideoGPA(视频几何偏好对齐)——一种数据高效的自监督框架,该框架利用几何基础模型自动生成密集偏好信号,通过直接偏好优化(DPO)指导VDMs。该方法无需人工标注即可有效引导生成分布趋向于固有的三维一致性。VideoGPA通过极少量偏好对显著提升了时间稳定性、物理合理性与运动连贯性,在大量实验中持续超越现有先进基线方法。