Recovering camera parameters from images and rendering scenes from novel viewpoints have been treated as separate tasks in computer vision and graphics. This separation breaks down when image coverage is sparse or poses are ambiguous, since each task depends on what the other produces. We propose Rays as Pixels, a Video Diffusion Model (VDM) that learns a joint distribution over videos and camera trajectories. To our knowledge, this is the first model to predict camera poses and do camera-controlled video generation within a single framework. We represent each camera as dense ray pixels (raxels), a pixel-aligned encoding that lives in the same latent space as video frames, and denoise the two jointly through a Decoupled Self-Cross Attention mechanism. A single trained model handles three tasks: predicting camera trajectories from video, generating video from input images along a pre-defined trajectory, and jointly synthesizing video and trajectory from input images. We evaluate on pose estimation and camera-controlled video generation, and introduce a closed-loop self-consistency test showing that the model's predicted poses and its renderings conditioned on those poses agree. Ablations against Plücker embeddings confirm that representing cameras in a shared latent space with video is subtantially more effective.
翻译:从图像中恢复相机参数以及从新视角渲染场景,在计算机视觉与图形学中一直被视为独立的任务。当图像覆盖稀疏或姿态模糊时,这种分离便失效了,因为每个任务都依赖另一个任务的输出结果。我们提出“光线即像素”方法——一种视频扩散模型(VDM),能够学习视频与相机轨迹的联合分布。据我们所知,这是首个在单一框架内同时预测相机姿态并实现相机控制视频生成的模型。我们将每个相机表征为密集的光线像素(raxels),这是一种与视频帧共享相同隐空间的像素对齐编码,并通过解耦自交叉注意力机制对二者进行联合去噪。一个训练好的模型可处理三项任务:从视频预测相机轨迹、沿预设轨迹从输入图像生成视频、以及从输入图像联合合成视频与轨迹。我们在姿态估计和相机控制视频生成任务上进行了评估,并引入闭环自洽性测试,证明模型预测的姿态与基于该姿态生成的渲染结果具有一致性。针对普吕克嵌入的消融实验证实,将相机与视频共享隐空间的表征方式具有显著更高的有效性。