Current controls over diffusion models (e.g., through text or ControlNet) for image generation fall short in recognizing abstract, continuous attributes like illumination direction or non-rigid shape change. In this paper, we present an approach for allowing users of text-to-image models to have fine-grained control of several attributes in an image. We do this by engineering special sets of input tokens that can be transformed in a continuous manner -- we call them Continuous 3D Words. These attributes can, for example, be represented as sliders and applied jointly with text prompts for fine-grained control over image generation. Given only a single mesh and a rendering engine, we show that our approach can be adopted to provide continuous user control over several 3D-aware attributes, including time-of-day illumination, bird wing orientation, dollyzoom effect, and object poses. Our method is capable of conditioning image creation with multiple Continuous 3D Words and text descriptions simultaneously while adding no overhead to the generative process. Project Page: https://ttchengab.github.io/continuous_3d_words
翻译:当前的扩散模型控制(如通过文本或ControlNet)在图像生成中难以识别抽象的连续属性,例如光照方向或非刚性形状变化。在本文中,我们提出一种方法,使文本到图像模型的用户能够对图像中的多个属性进行精细控制。我们通过设计一组可连续变换的特殊输入标记来实现这一点——我们将其称为“连续3D词汇”。例如,这些属性可以表示为滑块,并与文本提示联合应用,以实现对图像生成的精细控制。仅需一个单一网格和渲染引擎,我们展示了该方法能够提供对多种3D感知属性的连续用户控制,包括昼夜光照、鸟类翅膀朝向、推拉变焦效果以及物体姿态。我们的方法能够同时使用多个连续3D词汇和文本描述来条件化图像生成,且不增加生成过程中的额外开销。项目页面:https://ttchengab.github.io/continuous_3d_words