Modeling and producing lifelike clothed human images has attracted researchers' attention from different areas for decades, with the complexity from highly articulated and structured content. Rendering algorithms decompose and simulate the imaging process of a camera, while are limited by the accuracy of modeled variables and the efficiency of computation. Generative models can produce impressively vivid human images, however still lacking in controllability and editability. This paper studies photorealism enhancement of rendered images, leveraging generative power from diffusion models on the controlled basis of rendering. We introduce a novel framework to translate rendered images into their realistic counterparts, which consists of two stages: Domain Knowledge Injection (DKI) and Realistic Image Generation (RIG). In DKI, we adopt positive (real) domain finetuning and negative (rendered) domain embedding to inject knowledge into a pretrained Text-to-image (T2I) diffusion model. In RIG, we generate the realistic image corresponding to the input rendered image, with a Texture-preserving Attention Control (TAC) to preserve fine-grained clothing textures, exploiting the decoupled features encoded in the UNet structure. Additionally, we introduce SynFashion dataset, featuring high-quality digital clothing images with diverse textures. Extensive experimental results demonstrate the superiority and effectiveness of our method in rendered-to-real image translation.
翻译:数十年来,对逼真着装人体图像的建模与生成因其高度关节化与结构化内容所带来的复杂性,吸引了来自不同领域研究者的关注。渲染算法通过分解和模拟相机的成像过程来实现图像生成,但其效果受限于建模变量的精确度与计算效率。生成模型能够生成令人印象深刻的生动人体图像,但在可控性与可编辑性方面仍存在不足。本文研究渲染图像的真实感增强,在渲染提供的可控基础上利用扩散模型的生成能力。我们提出一种新颖的框架,将渲染图像转换为其对应的真实图像,该框架包含两个阶段:领域知识注入(DKI)与真实图像生成(RIG)。在DKI阶段,我们采用正域(真实图像)微调与负域(渲染图像)嵌入的方法,将知识注入预训练的文本到图像(T2I)扩散模型中。在RIG阶段,我们生成与输入渲染图像对应的真实图像,并利用一种纹理保持注意力控制(TAC)机制来保留细粒度的服装纹理,该机制利用了UNet结构中编码的解耦特征。此外,我们引入了SynFashion数据集,该数据集包含具有多样化纹理的高质量数字服装图像。大量实验结果证明了我们方法在渲染图像到真实图像转换任务上的优越性与有效性。