Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.
翻译:当前基于主题的图像生成方法在人物为中心的图像生成中面临显著挑战。其原因是这些方法通过微调通用预训练扩散模型来学习语义场景与人物生成,这导致不可调和的训练失衡。具体而言,为生成逼真人物,模型需要充分微调预训练模型,这不可避免地导致模型遗忘丰富的语义场景先验,并使场景生成过度拟合训练数据。此外,即使经过充分微调,这些方法仍无法生成高保真度人物,因为场景与人物生成的联合学习也会造成质量折衷。本文提出Face-diffuser,一种有效的协同生成框架,用于消除上述训练失衡与质量折衷。具体而言,我们首先开发两个专用预训练扩散模型,即文本驱动扩散模型(TDM)和主题增强扩散模型(SDM),分别用于场景和人物生成。采样过程分为三个连续阶段:语义场景构建、主题-场景融合和主题增强。第一阶段和第三阶段分别由TDM和SDM执行。主题-场景融合阶段通过一种新颖且高效的机制——显著性自适应噪声融合(SNF)实现协同。该机制基于我们的关键观察:无分类器引导响应与生成图像显著性之间存在稳健关联。在每个时间步中,SNF利用每个模型的独特优势,以显著性感知方式自动实现两个模型预测噪声的空间融合。大量实验证实了Face-diffuser的显著有效性和稳健性。