Stable-Pose: Leveraging Transformers for Pose-Guided Text-to-Image Generation

Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at https://github.com/ai-med/StablePose.

翻译：可控文本到图像（T2I）扩散模型通过融合多种条件，在生成高质量视觉内容方面展现出卓越性能。然而，现有方法在以人体骨骼姿态作为引导条件时，尤其在处理复杂姿态（如人体侧面或背面视角）时表现受限。为解决此问题，我们提出Stable-Pose——一种新颖的适配器模型，该模型将粗粒度到细粒度的注意力掩码策略引入视觉Transformer（ViT），从而为T2I模型提供精确的姿态引导。Stable-Pose专为在预训练的Stable Diffusion框架内灵活处理姿态条件而设计，通过优化姿态表征对齐机制，实现了高效精细的图像合成。我们利用ViT的查询-键自注意力机制，深入挖掘人体姿态骨骼中不同解剖部位间的内在关联。通过采用掩码姿态图像，系统能够以分层方式（从粗粒度到细粒度）基于目标姿态相关特征平滑优化注意力图。此外，我们设计的损失函数特别强化了对姿态区域的关注权重，从而显著提升了模型捕捉复杂姿态细节的精度。我们在涵盖广泛室内外人体姿态场景的五个公开数据集上评估了Stable-Pose的性能。在LAION-Human数据集中，Stable-Pose取得了57.1的AP分数，较现有主流技术ControlNet提升约13%。项目链接与代码已发布于https://github.com/ai-med/StablePose。