We introduce Multi-view Ancestral Sampling (MAS), a method for 3D motion generation, using 2D diffusion models that were trained on motions obtained from in-the-wild videos. As such, MAS opens opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing different views of the same 3D motion. It ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence, and projecting it back to the original views. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers, rhythmic gymnastic performances featuring a ball apparatus, and horse races. In each of these domains, 3D motion capture is arduous, and yet, MAS generates diverse and realistic 3D sequences. Unlike the Score Distillation approach, which optimizes each sample by repeatedly applying small fixes, our method uses a sampling process that was constructed for the diffusion framework. As we demonstrate, MAS avoids common issues such as out-of-domain sampling and mode-collapse. https://guytevet.github.io/mas-page/
翻译:我们提出了多视角祖先采样(MAS),这是一种利用二维扩散模型实现三维运动生成的方法,这些二维扩散模型基于从野外视频中获取的运动数据进行训练。因此,MAS为先前因三维数据稀缺且难以收集而探索不足的多样且激动人心的运动领域开辟了机遇。MAS通过同时去噪代表同一三维运动不同视角的多个二维运动序列来工作。它通过将各个生成结果合并为一个统一的三维序列,再将其投影回原始视角,从而在每个扩散步骤中确保所有视角间的一致性。我们在从视频中获取的二维姿态数据上验证了MAS,这些视频展示了职业篮球动作、使用球具的韵律体操表演以及赛马场景。在这些领域中,三维运动捕捉极其困难,而MAS却能生成多样且逼真的三维序列。与通过反复应用微小修正来优化每个样本的分数蒸馏方法不同,我们的方法采用了为扩散框架构建的采样过程。如我们所展示,MAS避免了域外采样和模式崩溃等常见问题。https://guytevet.github.io/mas-page/