Diffusion models have demonstrated high-quality performance in conditional text-to-image generation, particularly with structural cues such as edges, layouts, and depth. However, lighting conditions have received limited attention and remain difficult to control within the generative process. Existing methods handle lighting through a two-stage pipeline that relights images after generation, which is inefficient. Moreover, they rely on fine-tuning with large datasets and heavy computation, limiting their adaptability to new models and tasks. To address this, we propose a novel Training-Free Light-Guided Text-to-Image Diffusion Model via Initial Noise Manipulation (LGTM), which manipulates the initial latent noise of the diffusion process to guide image generation with text prompts and user-specified light directions. Through a channel-wise analysis of the latent space, we find that selectively manipulating latent channels enables fine-grained lighting control without fine-tuning or modifying the pre-trained model. Extensive experiments show that our method surpasses prompt-based baselines in lighting consistency, while preserving image quality and text alignment. This approach introduces new possibilities for dynamic, user-guided light control. Furthermore, it integrates seamlessly with models like ControlNet, demonstrating adaptability across diverse scenarios.
翻译:扩散模型在条件文本到图像生成中展现了高质量性能,特别是在利用边缘、布局和深度等结构线索时。然而,光照条件在生成过程中受到的关注有限且难以控制。现有方法通过两阶段流水线处理光照,即在生成后对图像进行重光照,但这效率低下。此外,这些方法依赖大规模数据集的微调和高计算开销,限制了其对新模型和任务的适应性。为解决此问题,我们提出一种新颖的无训练光引导文本到图像扩散模型(LGTM),通过操控扩散过程的初始潜在噪声,利用文本提示和用户指定的光照方向引导图像生成。通过对潜在空间进行通道级分析,我们发现选择性操控潜在通道能够在不微调或修改预训练模型的前提下实现精细的光照控制。大量实验表明,我们的方法在光照一致性上超越基于提示的基线方法,同时保持图像质量和文本对齐。该方法为动态、用户引导的光照控制开辟了新可能性。此外,其可无缝集成ControlNet等模型,展现了跨场景的适应性。