We propose 4Real-Video, a novel framework for generating 4D videos, organized as a grid of video frames with both time and viewpoint axes. In this grid, each row contains frames sharing the same timestep, while each column contains frames from the same viewpoint. We propose a novel two-stream architecture. One stream performs viewpoint updates on columns, and the other stream performs temporal updates on rows. After each diffusion transformer layer, a synchronization layer exchanges information between the two token streams. We propose two implementations of the synchronization layer, using either hard or soft synchronization. This feedforward architecture improves upon previous work in three ways: higher inference speed, enhanced visual quality (measured by FVD, CLIP, and VideoScore), and improved temporal and viewpoint consistency (measured by VideoScore and Dust3R-Confidence).
翻译:我们提出了4Real-Video,一个用于生成4D视频的新颖框架,其组织形式为具有时间和视角轴线的视频帧网格。在此网格中,每一行包含具有相同时间步长的帧,而每一列则包含来自同一视角的帧。我们提出了一种新颖的双流架构。一个流对列执行视角更新,另一个流对行执行时间更新。在每个扩散Transformer层之后,一个同步层在两个令牌流之间交换信息。我们提出了同步层的两种实现方式,分别使用硬同步或软同步。这种前馈架构在三个方面改进了先前的工作:更高的推理速度、增强的视觉质量(通过FVD、CLIP和VideoScore衡量)以及改进的时间和视角一致性(通过VideoScore和Dust3R-Confidence衡量)。