Controllable text-to-image (T2I) diffusion models have shown impressive performance in generating high-quality visual content through the incorporation of various conditions. Current methods, however, exhibit limited performance when guided by skeleton human poses, especially in complex pose conditions such as side or rear perspectives of human figures. To address this issue, we present Stable-Pose, a novel adapter model that introduces a coarse-to-fine attention masking strategy into a vision Transformer (ViT) to gain accurate pose guidance for T2I models. Stable-Pose is designed to adeptly handle pose conditions within pre-trained Stable Diffusion, providing a refined and efficient way of aligning pose representation during image synthesis. We leverage the query-key self-attention mechanism of ViTs to explore the interconnections among different anatomical parts in human pose skeletons. Masked pose images are used to smoothly refine the attention maps based on target pose-related features in a hierarchical manner, transitioning from coarse to fine levels. Additionally, our loss function is formulated to allocate increased emphasis to the pose region, thereby augmenting the model's precision in capturing intricate pose details. We assessed the performance of Stable-Pose across five public datasets under a wide range of indoor and outdoor human pose scenarios. Stable-Pose achieved an AP score of 57.1 in the LAION-Human dataset, marking around 13% improvement over the established technique ControlNet. The project link and code is available at https://github.com/ai-med/StablePose.
翻译:可控文本到图像(T2I)扩散模型通过融入多种条件,已在生成高质量视觉内容方面展现出卓越性能。然而,现有方法在以人体骨骼姿态作为引导条件时,尤其在处理复杂姿态(如人体侧视或后视角度)时表现受限。为解决此问题,我们提出了Stable-Pose——一种新颖的适配器模型,该模型将粗粒度到细粒度的注意力掩码策略引入视觉Transformer(ViT),从而为T2I模型提供精确的姿态引导。Stable-Pose旨在熟练处理预训练Stable Diffusion中的姿态条件,为图像合成过程中的姿态表征对齐提供精细化且高效的解决方案。我们利用ViT的查询-键自注意力机制,探索人体姿态骨骼中不同解剖部位间的内在关联。通过采用掩码姿态图像,系统能够基于目标姿态相关特征,以从粗到细的层次化方式平滑优化注意力图。此外,我们设计的损失函数特别增强了对姿态区域的关注权重,从而提升了模型捕捉复杂姿态细节的精度。我们在五个公开数据集上,针对广泛的室内外人姿态场景评估了Stable-Pose的性能。在LAION-Human数据集中,Stable-Pose取得了57.1的平均精度(AP)分数,相较于现有技术ControlNet提升约13%。项目链接与代码已发布于https://github.com/ai-med/StablePose。