High-fidelity Person-centric Subject-to-Image Synthesis

Current subject-driven image generation methods encounter significant challenges in person-centric image generation. The reason is that they learn the semantic scene and person generation by fine-tuning a common pre-trained diffusion, which involves an irreconcilable training imbalance. Precisely, to generate realistic persons, they need to sufficiently tune the pre-trained model, which inevitably causes the model to forget the rich semantic scene prior and makes scene generation over-fit to the training data. Moreover, even with sufficient fine-tuning, these methods can still not generate high-fidelity persons since joint learning of the scene and person generation also lead to quality compromise. In this paper, we propose Face-diffuser, an effective collaborative generation pipeline to eliminate the above training imbalance and quality compromise. Specifically, we first develop two specialized pre-trained diffusion models, i.e., Text-driven Diffusion Model (TDM) and Subject-augmented Diffusion Model (SDM), for scene and person generation, respectively. The sampling process is divided into three sequential stages, i.e., semantic scene construction, subject-scene fusion, and subject enhancement. The first and last stages are performed by TDM and SDM respectively. The subject-scene fusion stage, that is the collaboration achieved through a novel and highly effective mechanism, Saliency-adaptive Noise Fusion (SNF). Specifically, it is based on our key observation that there exists a robust link between classifier-free guidance responses and the saliency of generated images. In each time step, SNF leverages the unique strengths of each model and allows for the spatial blending of predicted noises from both models automatically in a saliency-aware manner. Extensive experiments confirm the impressive effectiveness and robustness of the Face-diffuser.

翻译：当前基于主题的图像生成方法在人物为中心的图像生成中面临显著挑战。其原因是这些方法通过微调通用预训练扩散模型来学习语义场景与人物生成，这导致不可调和的训练失衡。具体而言，为生成逼真人物，模型需要充分微调预训练模型，这不可避免地导致模型遗忘丰富的语义场景先验，并使场景生成过度拟合训练数据。此外，即使经过充分微调，这些方法仍无法生成高保真度人物，因为场景与人物生成的联合学习也会造成质量折衷。本文提出Face-diffuser，一种有效的协同生成框架，用于消除上述训练失衡与质量折衷。具体而言，我们首先开发两个专用预训练扩散模型，即文本驱动扩散模型（TDM）和主题增强扩散模型（SDM），分别用于场景和人物生成。采样过程分为三个连续阶段：语义场景构建、主题-场景融合和主题增强。第一阶段和第三阶段分别由TDM和SDM执行。主题-场景融合阶段通过一种新颖且高效的机制——显著性自适应噪声融合（SNF）实现协同。该机制基于我们的关键观察：无分类器引导响应与生成图像显著性之间存在稳健关联。在每个时间步中，SNF利用每个模型的独特优势，以显著性感知方式自动实现两个模型预测噪声的空间融合。大量实验证实了Face-diffuser的显著有效性和稳健性。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

【跨语言BERT模型大集合】Transfer learning is increasingly going multilingual with language-specific BERT models

专知会员服务

54+阅读 · 2020年1月30日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日