The recent emergence of latent diffusion models such as SDXL and SD 1.5 has shown significant capability in generating highly detailed and realistic images. Despite their remarkable ability to produce images, generating accurate text within images still remains a challenging task. In this paper, we examine the validity of fine-tuning approaches in generating legible text within the image. We propose a low-cost approach by leveraging SDXL without any time-consuming training on large-scale datasets. The proposed strategy employs a fine-tuning technique that examines the effects of data refinement levels and synthetic captions. Moreover, our results demonstrate how our small scale fine-tuning approach can improve the accuracy of text generation in different scenarios without the need of additional multimodal encoders. Our experiments show that with the addition of random letters to our raw dataset, our model's performance improves in producing well-formed visual text.
翻译:近期出现的潜在扩散模型(如SDXL和SD 1.5)已展现出生成高度细节化与逼真图像的显著能力。尽管这些模型在图像生成方面表现卓越,但在图像中生成准确文本仍是一项具有挑战性的任务。本文通过微调方法探究在图像中生成清晰文本的有效性。我们提出一种低成本方案,利用SDXL模型而无需在大规模数据集上进行耗时训练。该策略采用微调技术,系统考察数据精炼程度与合成字幕的影响机制。实验结果表明,我们的小规模微调方法能在无需额外多模态编码器的情况下,有效提升多种场景下的文本生成准确率。通过向原始数据集添加随机字母,模型在生成结构良好的视觉文本方面表现出显著性能提升。