Traditional 3D content creation tools empower users to bring their imagination to life by giving them direct control over a scene's geometry, appearance, motion, and camera path. Creating computer-generated videos, however, is a tedious manual process, which can be automated by emerging text-to-video diffusion models. Despite great promise, video diffusion models are difficult to control, hindering a user to apply their own creativity rather than amplifying it. To address this challenge, we present a novel approach that combines the controllability of dynamic 3D meshes with the expressivity and editability of emerging diffusion models. For this purpose, our approach takes an animated, low-fidelity rendered mesh as input and injects the ground truth correspondence information obtained from the dynamic mesh into various stages of a pre-trained text-to-image generation model to output high-quality and temporally consistent frames. We demonstrate our approach on various examples where motion can be obtained by animating rigged assets or changing the camera path.
翻译:传统三维内容创作工具通过赋予用户对场景几何、外观、运动及相机路径的直接控制能力,助力其将想象力转化为现实。然而,计算机生成视频的创作过程仍依赖繁琐的手工操作,新兴的文本到视频扩散模型可自动化这一流程。尽管视频扩散模型前景广阔,但其可控性较差,难以让用户发挥自身创造力。为应对这一挑战,我们提出一种融合动态三维网格可控性与扩散模型表现力及可编辑性的创新方法。该方法以动画化的低保真渲染网格为输入,将从动态网格中获取的真实对应关系信息注入预训练文本到图像生成模型的多个阶段,从而输出高质量且时间一致的帧序列。我们通过多个示例验证该方法,其中运动可通过驱动绑定资源动画或改变相机路径实现。