We present Face0, a novel way to instantaneously condition a text-to-image generation model on a face, in sample time, without any optimization procedures such as fine-tuning or inversions. We augment a dataset of annotated images with embeddings of the included faces and train an image generation model, on the augmented dataset. Once trained, our system is practically identical at inference time to the underlying base model, and is therefore able to generate images, given a user-supplied face image and a prompt, in just a couple of seconds. Our method achieves pleasing results, is remarkably simple, extremely fast, and equips the underlying model with new capabilities, like controlling the generated images both via text or via direct manipulation of the input face embeddings. In addition, when using a fixed random vector instead of a face embedding from a user supplied image, our method essentially solves the problem of consistent character generation across images. Finally, while requiring further research, we hope that our method, which decouples the model's textual biases from its biases on faces, might be a step towards some mitigation of biases in future text-to-image models.
翻译:我们提出Face0,一种全新的方法,能在推理时无需微调或反转等优化过程,即时将文本到图像生成模型与人脸条件化。我们通过增强注释图像数据集,将其中包含的人脸嵌入加入数据,并基于增强后的数据集训练图像生成模型。训练完成后,系统的推理过程与基础模型几乎一致,因此能在用户提供人脸图像和提示后的短短几秒内生成图像。该方法取得了令人满意的效果,且极为简单、极快,为基础模型赋予了新能力,例如通过文本或直接操控输入的人脸嵌入来控制生成图像。此外,当使用固定随机向量代替用户输入的人脸嵌入时,该方法本质解决了跨图像生成一致角色的问题。最后,尽管仍需进一步研究,但我们希望这种将模型文本偏见与人脸偏见解耦的方法,可能成为未来文本到图像模型减轻偏见的一个步骤。