We present a methodology for conditional control of human shape and pose in pretrained text-to-image diffusion models using a 3D human parametric model (SMPL). Fine-tuning these diffusion models to adhere to new conditions requires large datasets and high-quality annotations, which can be more cost-effectively acquired through synthetic data generation rather than real-world data. However, the domain gap and low scene diversity of synthetic data can compromise the pretrained model's visual fidelity. We propose a domain-adaptation technique that maintains image quality by isolating synthetically trained conditional information in the classifier-free guidance vector and composing it with another control network to adapt the generated images to the input domain. To achieve SMPL control, we fine-tune a ControlNet-based architecture on the synthetic SURREAL dataset of rendered humans and apply our domain adaptation at generation time. Experiments demonstrate that our model achieves greater shape and pose diversity than the 2d pose-based ControlNet, while maintaining the visual fidelity and improving stability, proving its usefulness for downstream tasks such as human animation.
翻译:本文提出了一种利用三维人体参数化模型(SMPL)在预训练的文本到图像扩散模型中实现人体形状与姿态条件控制的方法。为使扩散模型适应新的控制条件而进行微调通常需要大规模数据集与高质量标注,相较于真实世界数据,通过合成数据生成能以更低成本满足此需求。然而,合成数据存在的领域差异与场景多样性不足会损害预训练模型的视觉保真度。我们提出一种领域自适应技术,通过将基于合成数据训练的条件信息隔离在无分类器引导向量中,并结合另一控制网络将生成图像适配至输入域,从而保持图像质量。为实现SMPL控制,我们在合成人体渲染数据集SURREAL上对基于ControlNet的架构进行微调,并在生成阶段应用所提领域自适应方法。实验表明,相较于基于二维姿态的ControlNet,我们的模型在保持视觉保真度并提升稳定性的同时,实现了更高的形状与姿态多样性,证明了其在人体动画等下游任务中的实用价值。