Current text-to-image diffusion models excel at generating diverse, high-quality images, yet they struggle to incorporate fine-grained camera metadata such as precise aperture settings. In this work, we introduce a novel text-to-image diffusion framework that leverages camera metadata, or EXIF data, which is often embedded in image files, with an emphasis on generating controllable lens blur. Our method mimics the physical image formation process by first generating an all-in-focus image, estimating its monocular depth, predicting a plausible focus distance with a novel focus distance transformer, and then forming a defocused image with an existing differentiable lens blur model. Gradients flow backwards through this whole process, allowing us to learn without explicit supervision to generate defocus effects based on content elements and the provided EXIF data. At inference time, this enables precise interactive user control over defocus effects while preserving scene contents, which is not achievable with existing diffusion models. Experimental results demonstrate that our model enables superior fine-grained control without altering the depicted scene.
翻译:当前的文本到图像扩散模型在生成多样化、高质量图像方面表现出色,但在融入精细的相机元数据(如精确光圈设置)方面仍存在不足。本研究提出一种新颖的文本到图像扩散框架,该框架利用常嵌入于图像文件的相机元数据(即EXIF数据),重点实现可控镜头模糊的生成。我们的方法通过模拟物理成像过程:首先生成全焦点图像,估计其单目深度,利用新型聚焦距离Transformer预测合理的对焦距离,再通过现有的可微分镜头模糊模型形成离焦图像。梯度在整个过程中反向传播,使我们能够在无显式监督的情况下,根据内容元素和提供的EXIF数据学习生成离焦效果。在推理阶段,该方法能在保持场景内容的同时实现精确的交互式离焦效果控制,这是现有扩散模型无法实现的。实验结果表明,我们的模型在不改变场景描绘的前提下,实现了更优越的精细控制能力。