Stochastic video generation is particularly challenging when the camera is mounted on a moving platform, as camera motion interacts with observed image pixels, creating complex spatio-temporal dynamics and making the problem partially observable. Existing methods typically address this by focusing on raw pixel-level image reconstruction without explicitly modelling camera motion dynamics. We propose a solution by considering camera motion or action as part of the observed image state, modelling both image and action within a multi-modal learning framework. We introduce three models: Video Generation with Learning Action Prior (VG-LeAP) treats the image-action pair as an augmented state generated from a single latent stochastic process and uses variational inference to learn the image-action latent prior; Causal-LeAP, which establishes a causal relationship between action and the observed image frame at time $t$, learning an action prior conditioned on the observed image states; and RAFI, which integrates the augmented image-action state concept into flow matching with diffusion generative processes, demonstrating that this action-conditioned image generation concept can be extended to other diffusion-based models. We emphasize the importance of multi-modal training in partially observable video generation problems through detailed empirical studies on our new video action dataset, RoAM.
翻译:当摄像机安装在移动平台上时,随机视频生成尤其具有挑战性,因为摄像机运动与观测到的图像像素相互作用,产生复杂的时空动态,并使问题部分可观测。现有方法通常通过专注于原始像素级图像重建来解决此问题,而未显式建模摄像机运动动态。我们提出一种解决方案,将摄像机运动或动作视为观测图像状态的一部分,在多模态学习框架内对图像和动作进行联合建模。我们引入了三种模型:基于学习动作先验的视频生成模型将图像-动作对视为从单一潜在随机过程生成的增强状态,并使用变分推断来学习图像-动作潜在先验;因果-LeAP模型在动作与时间$t$的观测图像帧之间建立因果关系,学习以观测图像状态为条件的动作先验;以及RAFI模型,它将增强的图像-动作状态概念整合到流匹配与扩散生成过程中,证明了这种动作条件图像生成概念可以扩展到其他基于扩散的模型。通过在新型视频动作数据集RoAM上的详细实证研究,我们强调了多模态训练在部分可观测视频生成问题中的重要性。