DiLightNet: Fine-grained Lighting Control for Diffusion-based Image Generation

This paper presents a novel method for exerting fine-grained lighting control during text-driven diffusion-based image generation. While existing diffusion models already have the ability to generate images under any lighting condition, without additional guidance these models tend to correlate image content and lighting. Moreover, text prompts lack the necessary expressional power to describe detailed lighting setups. To provide the content creator with fine-grained control over the lighting during image generation, we augment the text-prompt with detailed lighting information in the form of radiance hints, i.e., visualizations of the scene geometry with a homogeneous canonical material under the target lighting. However, the scene geometry needed to produce the radiance hints is unknown. Our key observation is that we only need to guide the diffusion process, hence exact radiance hints are not necessary; we only need to point the diffusion model in the right direction. Based on this observation, we introduce a three stage method for controlling the lighting during image generation. In the first stage, we leverage a standard pretrained diffusion model to generate a provisional image under uncontrolled lighting. Next, in the second stage, we resynthesize and refine the foreground object in the generated image by passing the target lighting to a refined diffusion model, named DiLightNet, using radiance hints computed on a coarse shape of the foreground object inferred from the provisional image. To retain the texture details, we multiply the radiance hints with a neural encoding of the provisional synthesized image before passing it to DiLightNet. Finally, in the third stage, we resynthesize the background to be consistent with the lighting on the foreground object. We demonstrate and validate our lighting controlled diffusion model on a variety of text prompts and lighting conditions.

翻译：本文提出了一种新方法，用于在文本驱动的扩散图像生成过程中实现细粒度的光照控制。尽管现有扩散模型已具备在任何光照条件下生成图像的能力，但在缺乏额外引导时，这些模型倾向于将图像内容与光照相互关联。此外，文本提示缺乏描述详细光照设置所需的表达能力。为使内容创作者能够在图像生成中对光照进行精细控制，我们通过辐射提示（即在目标光照下以均匀规范材质呈现的场景几何可视化）的形式，将详细光照信息补充到文本提示中。然而，生成辐射提示所需的场景几何信息是未知的。我们的关键洞察在于：我们仅需引导扩散过程，因此精确的辐射提示并非必要——只需为扩散模型指明正确方向即可。基于这一观察，我们提出了一种三阶段方法来实现图像生成过程中的光照控制。第一阶段，我们利用标准预训练扩散模型在不受控光照条件下生成临时图像。第二阶段，通过将目标光照传递给名为DiLightNet的精炼扩散模型，并利用从临时图像推断出的前景物体粗略形状所计算的辐射提示，重新合成并优化生成图像中的前景物体。为保留纹理细节，我们在将辐射提示输入DiLightNet之前，先将其与临时合成图像的神经编码相乘。最后在第三阶段，重新合成与前景物体光照一致的背景。我们通过多种文本提示和光照条件验证了这一光照可控扩散模型的性能。