Scene Text Editing (STE) is a challenging research problem, that primarily aims towards modifying existing texts in an image while preserving the background and the font style of the original text. Despite its utility in numerous real-world applications, existing style-transfer-based approaches have shown sub-par editing performance due to (1) complex image backgrounds, (2) diverse font attributes, and (3) varying word lengths within the text. To address such limitations, in this paper, we propose a novel font-agnostic scene text editing and rendering framework, named FASTER, for simultaneously generating text in arbitrary styles and locations while preserving a natural and realistic appearance and structure. A combined fusion of target mask generation and style transfer units, with a cascaded self-attention mechanism has been proposed to focus on multi-level text region edits to handle varying word lengths. Extensive evaluation on a real-world database with further subjective human evaluation study indicates the superiority of FASTER in both scene text editing and rendering tasks, in terms of model performance and efficiency. Our code will be released upon acceptance.
翻译:场景文本编辑(STE)是一个具有挑战性的研究问题,其主要目标是在修改图像中现有文本的同时,保持背景和原始文本的字体样式。尽管在众多实际应用中具有实用性,但现有的基于风格迁移的方法由于(1)复杂的图像背景,(2)多样的字体属性,以及(3)文本内变化的单词长度,已显示出不尽人意的编辑性能。为了解决这些局限性,本文提出了一种新颖的字体无关场景文本编辑与渲染框架,命名为FASTER,用于在保持自然逼真的外观和结构的同时,在任意样式和位置生成文本。我们提出了一种结合目标掩码生成和风格迁移单元的融合方法,并采用级联自注意力机制,以专注于多层次的文本区域编辑,从而处理变化的单词长度。在真实世界数据库上的广泛评估以及进一步的主观人工评估研究表明,FASTER在模型性能和效率方面,在场景文本编辑和渲染任务中均具有优越性。我们的代码将在论文被接受后发布。