Despite recent advances in text-to-image generation, controlling geometric layout and PBR material properties in synthesized scenes remains challenging. We present a pipeline that first produces a G-buffer (albedo, normals, depth, roughness, shading, and metallic) from a text prompt and then renders a final image through a PBR-inspired branch network. This intermediate representation enables fine-grained control: users can copy and paste within specific G-buffer channels to insert or reposition objects, or apply masks to the irradiance channel to adjust lighting locally. As a result, real objects can be seamlessly integrated into virtual scenes. By separating user-friendly scene description from image rendering, our method offers a practical balance between detailed post-generation control and efficient text-driven synthesis. We demonstrate its effectiveness through quantitative evaluations and a user study with 156 participants, showing consistent human preference over strong baselines and confirming that G-buffer control extends the flexibility of text-guided image generation.
翻译:尽管文本到图像生成技术近期取得了进展,但在合成场景中控制几何布局与PBR材质属性仍具挑战性。我们提出一种流程:首先从文本提示生成G缓冲(包含反照率、法线、深度、粗糙度、着色与金属度),随后通过PBR启发的分支网络渲染最终图像。这种中间表示实现了细粒度控制:用户可在特定G缓冲通道内复制粘贴以插入或重定位物体,或对辐照度通道应用遮罩以局部调整光照。由此,真实物体能够无缝融入虚拟场景。通过将用户友好的场景描述与图像渲染相分离,我们的方法在精细化后生成控制与高效文本驱动合成之间实现了实用平衡。我们通过定量评估和156名参与者的用户研究证明了其有效性:结果显示该方法持续获得相较于强基线的用户偏好,并证实G缓冲控制扩展了文本引导图像生成的灵活性。