We introduce Multi-view Ancestral Sampling (MAS), a method for generating consistent multi-view 2D samples of a motion sequence, enabling the creation of its 3D counterpart. MAS leverages a diffusion model trained solely on 2D data, opening opportunities to exciting and diverse fields of motion previously under-explored as 3D data is scarce and hard to collect. MAS works by simultaneously denoising multiple 2D motion sequences representing the same motion from different angles. Our consistency block ensures consistency across all views at each diffusion step by combining the individual generations into a unified 3D sequence, and projecting it back to the original views for the next iteration. We demonstrate MAS on 2D pose data acquired from videos depicting professional basketball maneuvers, rhythmic gymnastic performances featuring a ball apparatus, and horse obstacle course races. In each of these domains, 3D motion capture is arduous, and yet, MAS generates diverse and realistic 3D sequences without textual conditioning. As we demonstrate, our ancestral sampling-based approach offers a more natural integration with the diffusion framework compared to popular denoising optimization-based approaches, and avoids common issues such as out-of-domain sampling, lack of details and mode-collapse. https://guytevet.github.io/mas-page/
翻译:我们提出多视角祖先采样(MAS)方法,用于生成运动序列的一致多视角二维样本,进而构建其三维对应序列。该方法仅依赖二维数据训练的扩散模型,为运动多样性探索开辟了新路径——鉴于三维数据稀缺且采集困难,这些领域以往鲜有涉及。MAS通过同步去噪来自不同视角的、表征同一运动的多个二维运动序列实现功能。其一致性模块在各扩散步骤中,通过将独立生成的样本融合为统一的三维序列,再反向投影至原始视角以进行后续迭代,从而确保跨视角一致性。我们在三类视频数据上验证了MAS方法:专业篮球战术动作、器械韵律体操表演及马术障碍赛跑。尽管这些领域的三维运动捕捉极具挑战性,MAS仍能在无文本条件约束下生成多样而逼真的三维序列。实验表明,与流行的基于去噪优化的方法相比,我们的祖先采样策略能更自然地与扩散框架融合,并有效规避域外采样、细节缺失及模式坍塌等常见问题。https://guytevet.github.io/mas-page/