Text-to-image models (T2I) such as StableDiffusion have been used to generate high quality images of people. However, due to the random nature of the generation process, the person has a different appearance e.g. pose, face, and clothing, despite using the same text prompt. The appearance inconsistency makes T2I unsuitable for pose transfer. We address this by proposing a multimodal diffusion model that accepts text, pose, and visual prompting. Our model is the first unified method to perform all person image tasks - generation, pose transfer, and mask-less edit. We also pioneer using small dimensional 3D body model parameters directly to demonstrate new capability - simultaneous pose and camera view interpolation while maintaining the person's appearance.
翻译:文本到图像模型(如StableDiffusion)已被用于生成高质量的人物图像。然而,由于生成过程的随机性,即便使用相同的文本提示,生成的人物在外观(如姿态、面部及衣着)上仍存在差异。这种外观不一致性使得文本到图像模型不适用于姿态迁移任务。为解决此问题,我们提出了一种多模态扩散模型,该模型可接受文本、姿态及视觉提示。本模型是首个统一执行所有人物图像任务(生成、姿态迁移及无掩膜编辑)的方法。同时,我们开创性地直接使用小维度三维人体模型参数,展示了在保持人物外观的同时实现姿态与相机视角同步插值的新能力。