We introduce FLAG-4D, a novel framework for generating novel views of dynamic scenes by reconstructing how 3D Gaussian primitives evolve through space and time. Existing methods typically rely on a single Multilayer Perceptron (MLP) to model temporal deformations, and they often struggle to capture complex point motions and fine-grained dynamic details consistently over time, especially from sparse input views. Our approach, FLAG-4D, overcomes this by employing a dual-deformation network that dynamically warps a canonical set of 3D Gaussians over time into new positions and anisotropic shapes. This dual-deformation network consists of an Instantaneous Deformation Network (IDN) for modeling fine-grained, local deformations and a Global Motion Network (GMN) for capturing long-range dynamics, refined through mutual learning. To ensure these deformations are both accurate and temporally smooth, FLAG-4D incorporates dense motion features from a pretrained optical flow backbone. We fuse these motion cues from adjacent timeframes and use a deformation-guided attention mechanism to align this flow information with the current state of each evolving 3D Gaussian. Extensive experiments demonstrate that FLAG-4D achieves higher-fidelity and more temporally coherent reconstructions with finer detail preservation than state-of-the-art methods.
翻译:我们提出了FLAG-4D,这是一个通过重建3D高斯基元在时空中的演化来生成动态场景新视图的新型框架。现有方法通常依赖单一的多层感知机(MLP)来建模时间变形,并且常常难以在时间上一致地捕捉复杂的点运动和细粒度动态细节,尤其是在稀疏输入视图的情况下。我们的方法FLAG-4D通过采用一个双变形网络克服了这一局限,该网络动态地将一组规范的3D高斯随时间扭曲到新的位置并形成各向异性的形状。该双变形网络由一个用于建模细粒度局部变形的瞬时变形网络(IDN)和一个用于捕捉长程动态的全局运动网络(GMN)组成,二者通过相互学习进行优化。为确保这些变形既准确又具有时间平滑性,FLAG-4D整合了来自预训练光流骨干网络的稠密运动特征。我们融合了相邻时间帧的这些运动线索,并利用一个变形引导的注意力机制将此光流信息与每个演化中的3D高斯的当前状态对齐。大量实验表明,与现有最先进方法相比,FLAG-4D能够实现更高保真度、时间更连贯且能更好保留细节的重建结果。