Recently, diffusion-based image generation methods are credited for their remarkable text-to-image generation capabilities, while still facing challenges in accurately generating multilingual scene text images. To tackle this problem, we propose Diff-Text, which is a training-free scene text generation framework for any language. Our model outputs a photo-realistic image given a text of any language along with a textual description of a scene. The model leverages rendered sketch images as priors, thus arousing the potential multilingual-generation ability of the pre-trained Stable Diffusion. Based on the observation from the influence of the cross-attention map on object placement in generated images, we propose a localized attention constraint into the cross-attention layer to address the unreasonable positioning problem of scene text. Additionally, we introduce contrastive image-level prompts to further refine the position of the textual region and achieve more accurate scene text generation. Experiments demonstrate that our method outperforms the existing method in both the accuracy of text recognition and the naturalness of foreground-background blending.
翻译:近期,基于扩散的图像生成方法因其卓越的文本到图像生成能力而备受赞誉,但在准确生成多语言场景文本图像方面仍面临挑战。为解决这一问题,我们提出了Diff-Text,这是一种无需训练、适用于任何语言的场景文本生成框架。我们的模型能够根据任意语言的文本及场景的文本描述,输出一张逼真的图像。该模型利用渲染的草图图像作为先验,从而激发预训练Stable Diffusion的多语言生成潜力。基于对交叉注意力图在生成图像中对物体放置影响的观察,我们在交叉注意力层中引入局部注意力约束,以解决场景文本定位不合理的问题。此外,我们引入对比性图像级提示,进一步优化文本区域的位置,实现更准确的场景文本生成。实验表明,我们的方法在文本识别准确性和前景-背景融合自然度方面均优于现有方法。