Modern text-to-video synthesis models demonstrate coherent, photorealistic generation of complex videos from a text description. However, most existing models lack fine-grained control over camera movement, which is critical for downstream applications related to content creation, visual effects, and 3D vision. Recently, new methods demonstrate the ability to generate videos with controllable camera poses these techniques leverage pre-trained U-Net-based diffusion models that explicitly disentangle spatial and temporal generation. Still, no existing approach enables camera control for new, transformer-based video diffusion models that process spatial and temporal information jointly. Here, we propose to tame video transformers for 3D camera control using a ControlNet-like conditioning mechanism that incorporates spatiotemporal camera embeddings based on Plucker coordinates. The approach demonstrates state-of-the-art performance for controllable video generation after fine-tuning on the RealEstate10K dataset. To the best of our knowledge, our work is the first to enable camera control for transformer-based video diffusion models.
翻译:现代文本到视频合成模型能够根据文本描述生成连贯、逼真的复杂视频。然而,现有模型大多缺乏对相机运动的细粒度控制,而这对于内容创作、视觉特效和三维视觉相关的下游应用至关重要。近期研究展示了生成具有可控相机姿态视频的能力,这些技术利用了经过预训练的基于U-Net的扩散模型,显式解耦了空间与时间生成过程。然而,目前仍缺乏对新型基于Transformer的视频扩散模型实现相机控制的方法,这类模型以联合方式处理时空信息。本研究提出通过类ControlNet的条件机制驯化视频Transformer以实现三维相机控制,该机制融合了基于普吕克坐标的时空相机嵌入。在RealEstate10K数据集上进行微调后,该方法在可控视频生成任务中展现出最先进的性能。据我们所知,本研究首次实现了基于Transformer的视频扩散模型的相机控制。