Generating animated 3D objects is at the heart of many applications, yet most advanced works are typically difficult to apply in practice because of their limited setup, their long runtime, or their limited quality. We introduce ActionMesh, a generative model that predicts production-ready 3D meshes "in action" in a feed-forward manner. Drawing inspiration from early video models, our key insight is to modify existing 3D diffusion models to include a temporal axis, resulting in a framework we dubbed "temporal 3D diffusion". Specifically, we first adapt the 3D diffusion stage to generate a sequence of synchronized latents representing time-varying and independent 3D shapes. Second, we design a temporal 3D autoencoder that translates a sequence of independent shapes into the corresponding deformations of a pre-defined reference shape, allowing us to build an animation. Combining these two components, ActionMesh generates animated 3D meshes from different inputs like a monocular video, a text description, or even a 3D mesh with a text prompt describing its animation. Besides, compared to previous approaches, our method is fast and produces results that are rig-free and topology consistent, hence enabling rapid iteration and seamless applications like texturing and retargeting. We evaluate our model on standard video-to-4D benchmarks (Consistent4D, Objaverse) and report state-of-the-art performances on both geometric accuracy and temporal consistency, demonstrating that our model can deliver animated 3D meshes with unprecedented speed and quality.
翻译:生成动画三维对象是众多应用的核心,然而现有先进方法通常因设定受限、运行时间过长或质量不足而难以实际应用。我们提出ActionMesh——一种以前馈方式直接生成"动态"生产级三维网格的生成模型。受早期视频模型启发,我们的核心思路是对现有三维扩散模型引入时间轴,构建出称为"时序三维扩散"的框架。具体而言,我们首先改进三维扩散阶段以生成表征时变独立三维形状的同步潜在序列;其次设计时序三维自编码器,将独立形状序列转换为预定义参考形状的对应形变,从而构建动画。结合这两个组件,ActionMesh能够从单目视频、文本描述甚至附带动画描述文本的三维网格等多种输入生成动画三维网格。此外,相较于现有方法,本方法具有速度快、生成结果无需骨骼绑定且拓扑一致的特性,支持快速迭代并兼容纹理映射与重定向等无缝应用。我们在标准视频转4D基准数据集(Consistent4D、Objaverse)上评估模型,在几何精度与时间一致性方面均达到最先进性能,证明本方法能以前所未有的速度与质量生成动画三维网格。