We present a technique for automatically producing a deformation of an input triangle mesh, guided solely by a text prompt. Our framework is capable of deformations that produce both large, low-frequency shape changes, and small high-frequency details. Our framework relies on differentiable rendering to connect geometry to powerful pre-trained image encoders, such as CLIP and DINO. Notably, updating mesh geometry by taking gradient steps through differentiable rendering is notoriously challenging, commonly resulting in deformed meshes with significant artifacts. These difficulties are amplified by noisy and inconsistent gradients from CLIP. To overcome this limitation, we opt to represent our mesh deformation through Jacobians, which updates deformations in a global, smooth manner (rather than locally-sub-optimal steps). Our key observation is that Jacobians are a representation that favors smoother, large deformations, leading to a global relation between vertices and pixels, and avoiding localized noisy gradients. Additionally, to ensure the resulting shape is coherent from all 3D viewpoints, we encourage the deep features computed on the 2D encoding of the rendering to be consistent for a given vertex from all viewpoints. We demonstrate that our method is capable of smoothly-deforming a wide variety of source mesh and target text prompts, achieving both large modifications to, e.g., body proportions of animals, as well as adding fine semantic details, such as shoe laces on an army boot and fine details of a face.
翻译:我们提出了一种技术,能够仅通过文本提示自动生成输入三角形网格的形变。该框架既能产生低频的大尺度形状变化,也能生成高频的微小细节。其核心在于利用可微分渲染将几何结构连接到强大的预训练图像编码器(如CLIP和DINO)。值得注意的是,通过可微分渲染对网格几何结构进行梯度步更新极具挑战性,通常会导致带有显著伪影的变形网格,而CLIP产生的噪声和不一致梯度进一步加剧了这些困难。为克服这一限制,我们选择通过Jacobian矩阵表示网格形变,以全局平滑的方式更新形变(而非局部次优步长)。关键发现是:Jacobian矩阵作为一种表征方式,倾向于生成更平滑的大尺度形变,从而在顶点与像素间建立全局关联,避免局部噪声梯度的影响。此外,为确保最终形状在所有三维视角下保持连贯性,我们强制要求从任意视角渲染的二维编码中计算的深层特征对于同一顶点保持一致。实验证明,该方法能平滑地处理多种源网格与目标文本提示的组合,既实现了动物身体比例等大幅修改,也能添加诸如军靴鞋带、面部精细细节等语义细节。