Recent advances in camera-controlled video diffusion models have significantly improved video-camera alignment. However, the camera controllability still remains limited. In this work, we build upon Reward Feedback Learning and aim to further improve camera controllability. However, directly borrowing existing ReFL approaches faces several challenges. First, current reward models lack the capacity to assess video-camera alignment. Second, decoding latent into RGB videos for reward computation introduces substantial computational overhead. Third, 3D geometric information is typically neglected during video decoding. To address these limitations, we introduce an efficient camera-aware 3D decoder that decodes video latent into 3D representations for reward quantization. Specifically, video latent along with the camera pose are decoded into 3D Gaussians. In this process, the camera pose not only acts as input, but also serves as a projection parameter. Misalignment between the video latent and camera pose will cause geometric distortions in the 3D structure, resulting in blurry renderings. Based on this property, we explicitly optimize pixel-level consistency between the rendered novel views and ground-truth ones as reward. To accommodate the stochastic nature, we further introduce a visibility term that selectively supervises only deterministic regions derived via geometric warping. Extensive experiments conducted on RealEstate10K and WorldScore benchmarks demonstrate the effectiveness of our proposed method. Project page: \href{https://a-bigbao.github.io/CamPilot/}{CamPilot Page}.
翻译:近年来,相机控制视频扩散模型的进展显著提升了视频与相机轨迹的对齐效果。然而,相机可控性仍然存在局限。本研究基于奖励反馈学习,旨在进一步提升相机可控性。然而,直接沿用现有的ReFL方法面临若干挑战。首先,现有奖励模型缺乏评估视频-相机对齐的能力。其次,将潜变量解码为RGB视频以计算奖励会引入巨大的计算开销。第三,视频解码过程通常忽略三维几何信息。为应对这些限制,我们提出了一种高效的相机感知三维解码器,可将视频潜变量解码为三维表示以进行奖励量化。具体而言,视频潜变量与相机位姿被解码为三维高斯表示。在此过程中,相机位姿不仅作为输入,还充当投影参数。视频潜变量与相机位姿之间的失配将导致三维结构的几何畸变,从而产生模糊的渲染结果。基于此特性,我们显式地优化渲染新视角与真实视角之间的像素级一致性作为奖励。为适应随机性,我们进一步引入了可见性项,该项仅对通过几何变形推导出的确定性区域进行选择性监督。在RealEstate10K和WorldScore基准上进行的大量实验验证了所提方法的有效性。项目页面:\href{https://a-bigbao.github.io/CamPilot/}{CamPilot Page}。