Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at https://github.com/ai-med/StablePose.

翻译：可控文本到图像（T2I）扩散模型通过融入多种条件，已在生成高质量视觉内容方面展现出卓越性能。然而，现有方法在以人体骨骼姿态作为引导条件时，尤其在处理复杂姿态（如人体侧视或后视角度）时表现受限。为解决此问题，我们提出了Stable-Pose——一种新颖的适配器模型，该模型将粗粒度到细粒度的注意力掩码策略引入视觉Transformer（ViT），从而为T2I模型提供精确的姿态引导。Stable-Pose旨在熟练处理预训练Stable Diffusion中的姿态条件，为图像合成过程中的姿态表征对齐提供精细化且高效的解决方案。我们利用ViT的查询-键自注意力机制，探索人体姿态骨骼中不同解剖部位间的内在关联。通过采用掩码姿态图像，系统能够基于目标姿态相关特征，以从粗到细的层次化方式平滑优化注意力图。此外，我们设计的损失函数特别增强了对姿态区域的关注权重，从而提升了模型捕捉复杂姿态细节的精度。我们在五个公开数据集上，针对广泛的室内外人姿态场景评估了Stable-Pose的性能。在LAION-Human数据集中，Stable-Pose取得了57.1的平均精度（AP）分数，相较于现有技术ControlNet提升约13%。项目链接与代码已发布于https://github.com/ai-med/StablePose。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日