Pose-Guided Person Image Synthesis (PGPIS) aims to synthesize high-quality person images corresponding to target poses while preserving the appearance of the source image. Recently, PGPIS methods that use diffusion models have achieved competitive performance. Most approaches involve extracting representations of the target pose and source image and learning their relationships in the generative model's training process. This approach makes it difficult to learn the semantic relationships between the input and target images and complicates the model structure needed to enhance generation results. To address these issues, we propose Fusion embedding for PGPIS using a Diffusion Model (FPDM). Inspired by the successful application of pre-trained CLIP models in text-to-image diffusion models, our method consists of two stages. The first stage involves training the fusion embedding of the source image and target pose to align with the target image's embedding. In the second stage, the generative model uses this fusion embedding as a condition to generate the target image. We applied the proposed method to the benchmark datasets DeepFashion and RWTH-PHOENIX-Weather 2014T, and conducted both quantitative and qualitative evaluations, demonstrating state-of-the-art (SOTA) performance. An ablation study of the model structure showed that even a model using only the second stage achieved performance close to the other PGPIS SOTA models. The code is available at https://github.com/dhlee-work/FPDM.
翻译:姿态引导人物图像合成(PGPIS)旨在合成与目标姿态对应的高质量人物图像,同时保持源图像的外观特征。近年来,采用扩散模型的PGPIS方法已取得显著性能。现有方法通常提取目标姿态和源图像的表示,并在生成模型训练过程中学习其关联关系。这种方式难以有效学习输入图像与目标图像之间的语义关联,且为提升生成效果所需的模型结构较为复杂。为解决这些问题,我们提出基于扩散模型的PGPIS融合嵌入方法(FPDM)。受预训练CLIP模型在文生图扩散模型中成功应用的启发,本方法包含两个阶段:第一阶段训练源图像与目标姿态的融合嵌入,使其与目标图像的嵌入对齐;第二阶段生成模型以该融合嵌入为条件生成目标图像。我们在基准数据集DeepFashion和RWTH-PHOENIX-Weather 2014T上应用所提方法,通过定量与定性评估证明了其达到最先进(SOTA)性能。模型结构的消融研究表明,即使仅使用第二阶段的模型也能获得接近其他PGPIS SOTA模型的性能。代码已开源:https://github.com/dhlee-work/FPDM。