Sketch2Human: Deep Human Generation with Disentangled Geometry and Appearance Control

Geometry- and appearance-controlled full-body human image generation is an interesting but challenging task. Existing solutions are either unconditional or dependent on coarse conditions (e.g., pose, text), thus lacking explicit geometry and appearance control of body and garment. Sketching offers such editing ability and has been adopted in various sketch-based face generation and editing solutions. However, directly adapting sketch-based face generation to full-body generation often fails to produce high-fidelity and diverse results due to the high complexity and diversity in the pose, body shape, and garment shape and texture. Recent geometrically controllable diffusion-based methods mainly rely on prompts to generate appearance and it is hard to balance the realism and the faithfulness of their results to the sketch when the input is coarse. This work presents Sketch2Human, the first system for controllable full-body human image generation guided by a semantic sketch (for geometry control) and a reference image (for appearance control). Our solution is based on the latent space of StyleGAN-Human with inverted geometry and appearance latent codes as input. Specifically, we present a sketch encoder trained with a large synthetic dataset sampled from StyleGAN-Human's latent space and directly supervised by sketches rather than real images. Considering the entangled information of partial geometry and texture in StyleGAN-Human and the absence of disentangled datasets, we design a novel training scheme that creates geometry-preserved and appearance-transferred training data to tune a generator to achieve disentangled geometry and appearance control. Although our method is trained with synthetic data, it can handle hand-drawn sketches as well. Qualitative and quantitative evaluations demonstrate the superior performance of our method to state-of-the-art methods.

翻译：几何与外观可控的全身人体图像生成是一项有趣但具有挑战性的任务。现有解决方案要么是无条件的，要么依赖于粗粒度条件（如姿势、文本），缺乏对身体和服装的显式几何与外观控制。草图提供了这种编辑能力，并已被应用于多种基于草图的人脸生成与编辑方案中。然而，直接将基于草图的人脸生成方法迁移到全身人体生成中，往往因姿势、体型、服装形状及纹理的高度复杂性与多样性，而难以生成高保真度且多样化的结果。近期基于几何可扩散控制的方法主要依赖提示词生成外观，当输入草图较为粗糙时，难以在结果真实性与对草图的忠实度之间取得平衡。本文提出Sketch2Human，这是首个由语义草图（控制几何）和参考图像（控制外观）引导的可控全身人体图像生成系统。我们的解决方案基于StyleGAN-Human的潜在空间，以反转后的几何与外观潜在编码作为输入。具体而言，我们训练了一个草图编码器，该编码器使用从StyleGAN-Human潜在空间中采样的大规模合成数据集进行训练，并直接以草图而非真实图像作为监督信号。考虑到StyleGAN-Human中部分几何与纹理信息的纠缠性以及解耦数据集的缺失，我们设计了一种新颖的训练方案，通过生成几何保持且外观迁移的训练数据来调整生成器，从而实现几何与外观控制的解耦。尽管我们的方法使用合成数据训练，但它同样能够处理手绘草图。定性与定量评估表明，我们的方法在性能上优于当前最先进的方案。