ViTEraser: Harnessing the Power of Vision Transformers for Scene Text Removal with SegMIM Pretraining

Scene text removal (STR) aims at replacing text strokes in natural scenes with visually coherent backgrounds. Recent STR approaches rely on iterative refinements or explicit text masks, resulting in higher complexity and sensitivity to the accuracy of text localization. Moreover, most existing STR methods utilize convolutional neural networks (CNNs) for feature representation while the potential of vision Transformers (ViTs) remains largely unexplored. In this paper, we propose a simple-yet-effective ViT-based text eraser, dubbed ViTEraser. Following a concise encoder-decoder framework, different types of ViTs can be easily integrated into ViTEraser to enhance the long-range dependencies and global reasoning. Specifically, the encoder hierarchically maps the input image into the hidden space through ViT blocks and patch embedding layers, while the decoder gradually upsamples the hidden features to the text-erased image with ViT blocks and patch splitting layers. As ViTEraser implicitly integrates text localization and inpainting, we propose a novel end-to-end pretraining method, termed SegMIM, which focuses the encoder and decoder on the text box segmentation and masked image modeling tasks, respectively. To verify the effectiveness of the proposed methods, we comprehensively explore the architecture, pretraining, and scalability of the ViT-based encoder-decoder for STR, which provides deep insights into the application of ViT to STR. Experimental results demonstrate that ViTEraser with SegMIM achieves state-of-the-art performance on STR by a substantial margin. Furthermore, the extended experiment on tampered scene text detection demonstrates the generality of ViTEraser to other tasks. We believe this paper can inspire more research on ViT-based STR approaches. Code will be available at https://github.com/shannanyinxiang/ViTEraser.

翻译：场景文本去除（STR）旨在用视觉连贯的背景替换自然场景中的文本笔画。近期STR方法依赖迭代优化或显式文本掩码，导致复杂度较高且对文本定位准确性敏感。此外，大多数现有STR方法使用卷积神经网络（CNN）进行特征表示，而视觉Transformer（ViTs）的潜力尚未被充分探索。本文提出一种简单有效的基于ViT的文本擦除器，命名为ViTEraser。遵循简洁的编码器-解码器框架，不同类型ViT可轻松集成到ViTEraser中，以增强长距离依赖与全局推理能力。具体而言，编码器通过ViT模块和补丁嵌入层将输入图像分层映射至隐藏空间，而解码器通过ViT模块和补丁分割层逐步对隐藏特征进行上采样以生成文本擦除图像。由于ViTEraser隐式整合了文本定位与修复，我们提出一种新型端到端预训练方法SegMIM，该方法分别引导编码器和解码器聚焦于文本框分割与掩码图像建模任务。为验证提出方法的有效性，我们全面探索了基于ViT的编码器-解码器在STR中的架构、预训练与可扩展性，为ViT在STR中的应用提供了深刻见解。实验结果表明，配备SegMIM的ViTEraser在STR任务上以显著优势达到当前最优性能。此外，针对篡改场景文本检测的扩展实验证明了ViTEraser对其他任务的泛化能力。我们相信本文能启发更多关于基于ViT的STR方法的研究。代码将开源至https://github.com/shannanyinxiang/ViTEraser。