Improving Diffusion Models for Scene Text Editing with Dual Encoders

Scene text editing is a challenging task that involves modifying or inserting specified texts in an image while maintaining its natural and realistic appearance. Most previous approaches to this task rely on style-transfer models that crop out text regions and feed them into image transfer models, such as GANs. However, these methods are limited in their ability to change text style and are unable to insert texts into images. Recent advances in diffusion models have shown promise in overcoming these limitations with text-conditional image editing. However, our empirical analysis reveals that state-of-the-art diffusion models struggle with rendering correct text and controlling text style. To address these problems, we propose DIFFSTE to improve pre-trained diffusion models with a dual encoder design, which includes a character encoder for better text legibility and an instruction encoder for better style control. An instruction tuning framework is introduced to train our model to learn the mapping from the text instruction to the corresponding image with either the specified style or the style of the surrounding texts in the background. Such a training method further brings our method the zero-shot generalization ability to the following three scenarios: generating text with unseen font variation, e.g., italic and bold, mixing different fonts to construct a new font, and using more relaxed forms of natural language as the instructions to guide the generation task. We evaluate our approach on five datasets and demonstrate its superior performance in terms of text correctness, image naturalness, and style controllability. Our code is publicly available. https://github.com/UCSB-NLP-Chang/DiffSTE

翻译：场景文本编辑是一项具有挑战性的任务，涉及在图像中修改或插入指定文本，同时保持其自然逼真的外观。以往大多数方法依赖风格迁移模型，通过裁剪文本区域并将其输入图像迁移模型（如生成对抗网络）进行处理。然而，这些方法在改变文本风格的能力上存在局限，且无法在图像中插入文本。扩散模型的最新进展在文本条件图像编辑方面展现出克服这些限制的潜力。然而，我们的实证分析表明，最先进的扩散模型在正确渲染文本和控制文本风格方面仍存在困难。为解决这些问题，我们提出DIFFSTE，通过双编码器设计改进预训练扩散模型，该设计包含一个用于提升文本可读性的字符编码器和一个用于增强风格控制的指令编码器。我们引入指令微调框架来训练模型，使其学习从文本指令到对应图像的映射，这种映射能生成具有指定风格或与背景中周围文本风格一致的图像。这种训练方法进一步赋予模型在以下三种场景中的零样本泛化能力：生成具有未见字体变体（如斜体、粗体）的文本、混合不同字体构建新字体，以及使用更宽松的自然语言形式作为指令来引导生成任务。我们在五个数据集上评估了该方法，并展示了其在文本正确性、图像自然度和风格可控性方面的优越性能。我们的代码已公开，可通过 https://github.com/UCSB-NLP-Chang/DiffSTE 获取。