Scene text segmentation aims at cropping texts from scene images, which is usually used to help generative models edit or remove texts. The existing text segmentation methods tend to involve various text-related supervisions for better performance. However, most of them ignore the importance of text edges, which are significant for downstream applications. In this paper, we propose Edge-Aware Transformers, termed EAFormer, to segment texts more accurately, especially at the edge of texts. Specifically, we first design a text edge extractor to detect edges and filter out edges of non-text areas. Then, we propose an edge-guided encoder to make the model focus more on text edges. Finally, an MLP-based decoder is employed to predict text masks. We have conducted extensive experiments on commonly-used benchmarks to verify the effectiveness of EAFormer. The experimental results demonstrate that the proposed method can perform better than previous methods, especially on the segmentation of text edges. Considering that the annotations of several benchmarks (e.g., COCO_TS and MLT_S) are not accurate enough to fairly evaluate our methods, we have relabeled these datasets. Through experiments, we observe that our method can achieve a higher performance improvement when more accurate annotations are used for training.
翻译:场景文本分割旨在从场景图像中裁剪文本区域,通常用于辅助生成模型编辑或移除文本。现有文本分割方法往往依赖多种文本相关监督信息以提升性能,但大多忽视了文本边缘的重要性,而边缘信息对下游应用至关重要。本文提出边缘感知Transformer(简称EAFormer)以实现更精确的文本分割,尤其在文本边缘区域。具体而言,我们首先设计文本边缘提取器来检测边缘并滤除非文本区域的边缘。随后提出边缘引导编码器,使模型更聚焦于文本边缘特征。最后采用基于MLP的解码器预测文本掩码。我们在常用基准数据集上进行了广泛实验以验证EAFormer的有效性。实验结果表明,所提方法性能优于现有方法,在文本边缘分割方面表现尤为突出。考虑到部分基准数据集(如COCO_TS和MLT_S)的标注精度不足以公平评估本方法,我们重新标注了这些数据集。实验发现,当使用更精确的标注进行训练时,本方法能获得更显著的性能提升。