Personalizing text-to-image models using a limited set of images for a specific object has been explored in subject-specific image generation. However, existing methods often face challenges in aligning with text prompts due to overfitting to the limited training images. In this work, we introduce InstructBooth, a novel method designed to enhance image-text alignment in personalized text-to-image models without sacrificing the personalization ability. Our approach first personalizes text-to-image models with a small number of subject-specific images using a unique identifier. After personalization, we fine-tune personalized text-to-image models using reinforcement learning to maximize a reward that quantifies image-text alignment. Additionally, we propose complementary techniques to increase the synergy between these two processes. Our method demonstrates superior image-text alignment compared to existing baselines, while maintaining high personalization ability. In human evaluations, InstructBooth outperforms them when considering all comprehensive factors. Our project page is at https://sites.google.com/view/instructbooth.
翻译:针对特定对象的少量图像进行文本到图像模型的个性化定制已在主题特定图像生成中有所探索。然而,现有方法常因对有限训练图像的过拟合而难以与文本提示对齐。本文提出InstructBooth,一种在保持个性化能力的同时增强个性化文本到图像模型中图像-文本对齐的新方法。我们的方法首先使用唯一标识符通过少量特定主题图像对文本到图像模型进行个性化处理。个性化后,通过强化学习微调个性化文本到图像模型,以最大化量化图像-文本对齐的奖励。此外,我们提出互补技术以增强这两个过程之间的协同效应。与现有基线方法相比,我们的方法在保持高个性化能力的同时展现出更优的图像-文本对齐效果。在人工评估中,InstructBooth在所有综合因素考量下表现更佳。项目页面:https://sites.google.com/view/instructbooth。