Recent works on personalized text-to-image generation usually learn to bind a special token with specific subjects or styles of a few given images by tuning its embedding through gradient descent. It is natural to question whether we can optimize the textual inversions by only accessing the process of model inference. As only requiring the forward computation to determine the textual inversion retains the benefits of less GPU memory, simple deployment, and secure access for scalable models. In this paper, we introduce a \emph{gradient-free} framework to optimize the continuous textual inversion in an iterative evolutionary strategy. Specifically, we first initialize an appropriate token embedding for textual inversion with the consideration of visual and text vocabulary information. Then, we decompose the optimization of evolutionary strategy into dimension reduction of searching space and non-convex gradient-free optimization in subspace, which significantly accelerates the optimization process with negligible performance loss. Experiments in several applications demonstrate that the performance of text-to-image model equipped with our proposed gradient-free method is comparable to that of gradient-based counterparts with variant GPU/CPU platforms, flexible employment, as well as computational efficiency.
翻译:近期关于个性化文本生成图像的研究通常通过梯度下降调整特定标记的嵌入,将特殊标记与给定图像的主题或风格绑定。我们自然要探究是否仅通过模型推理过程即可优化文本反转。由于仅需前向计算来确定文本反转,因此保留了GPU内存占用更少、部署简单且可安全访问可扩展模型的优势。本文提出了一种基于迭代进化策略的无梯度框架,用于优化连续文本反转。具体而言,我们首先考虑视觉与文本词汇信息,为文本反转初始化合适的标记嵌入。随后将进化策略的优化分解为搜索空间降维与子空间中的非凸无梯度优化,从而在性能损失可忽略的前提下显著加速优化过程。多个应用场景的实验表明,采用本文提出的无梯度方法的文本生成图像模型,在不同GPU/CPU平台、灵活部署及计算效率方面均能达到与基于梯度方法相当的性能。