We propose Mesh4D, a feed-forward model for monocular 4D mesh reconstruction. Given a monocular video of a dynamic object, our model reconstructs the object's complete 3D shape and motion, represented as a deformation field. Our key contribution is a compact latent space that encodes the entire animation sequence in a single pass. This latent space is learned by an autoencoder that, during training, is guided by the skeletal structure of the training objects, providing strong priors on plausible deformations. Crucially, skeletal information is not required at inference time. The encoder employs spatio-temporal attention, yielding a more stable representation of the object's overall deformation. Building on this representation, we train a latent diffusion model that, conditioned on the input video and the mesh reconstructed from the first frame, predicts the full animation in one shot. We evaluate Mesh4D on reconstruction and novel view synthesis benchmarks, outperforming prior methods in recovering accurate 3D shape and deformation.
翻译:我们提出Mesh4D,一种用于单目4D网格重建的前馈模型。给定动态物体的单目视频,我们的模型重建物体的完整三维形状与运动,并将其表示为变形场。我们的核心贡献在于构建了一个紧凑的潜空间,能够一次性编码整个动画序列。该潜空间通过自编码器学习得到,训练过程中利用训练对象的骨骼结构进行引导,从而为合理变形提供强先验。关键的是,推理阶段无需骨骼信息。编码器采用时空注意力机制,能够更稳定地表示物体的整体变形。基于此表征,我们训练了一个潜扩散模型,该模型以输入视频及首帧重建的网格为条件,一次性预测完整动画。我们在重建与新视角合成基准测试上评估Mesh4D,其在恢复精确三维形状与变形方面优于现有方法。