Image generation today can produce somewhat realistic images from text prompts. However, if one asks the generator to synthesize a particular camera setting such as creating different fields of view using a 24mm lens versus a 70mm lens, the generator will not be able to interpret and generate scene-consistent images. This limitation not only hinders the adoption of generative tools in photography applications but also exemplifies a broader issue of bridging the gap between the data-driven models and the physical world. In this paper, we introduce the concept of Generative Photography, a framework designed to control camera intrinsic settings during content generation. The core innovation of this work are the concepts of Dimensionality Lifting and Contrastive Camera Learning, which achieve continuous and consistent transitions for different camera settings. Experimental results show that our method produces significantly more scene-consistent photorealistic images than state-of-the-art models such as Stable Diffusion 3 and FLUX.
翻译:当前的图像生成技术能够根据文本提示生成具有一定真实感的图像。然而,若要求生成器合成特定的相机参数设置(例如,分别使用24mm镜头与70mm镜头生成不同视场的图像),生成器将无法理解并生成场景一致的图像。这一局限不仅阻碍了生成式工具在摄影应用中的普及,更凸显了数据驱动模型与物理世界之间鸿沟这一更广泛的问题。本文提出“生成式摄影”的概念,这是一个在内容生成过程中控制相机内参设置的框架。本工作的核心创新在于“维度提升”与“对比相机学习”的概念,它们实现了不同相机参数设置下连续且一致的过渡效果。实验结果表明,相较于Stable Diffusion 3、FLUX等先进模型,我们的方法能够生成场景一致性显著更强的逼真图像。