Existing scene text detection methods typically rely on extensive real data for training. Due to the lack of annotated real images, recent works have attempted to exploit large-scale labeled synthetic data (LSD) for pre-training text detectors. However, a synth-to-real domain gap emerges, further limiting the performance of text detectors. Differently, in this work, we propose FreeReal, a real-domain-aligned pre-training paradigm that enables the complementary strengths of both LSD and unlabeled real data (URD). Specifically, to bridge real and synthetic worlds for pre-training, a glyph-based mixing mechanism (GlyphMix) is tailored for text images.GlyphMix delineates the character structures of synthetic images and embeds them as graffiti-like units onto real images. Without introducing real domain drift, GlyphMix freely yields real-world images with annotations derived from synthetic labels. Furthermore, when given free fine-grained synthetic labels, GlyphMix can effectively bridge the linguistic domain gap stemming from English-dominated LSD to URD in various languages. Without bells and whistles, FreeReal achieves average gains of 1.97%, 3.90%, 3.85%, and 4.56% in improving the performance of FCENet, PSENet, PANet, and DBNet methods, respectively, consistently outperforming previous pre-training methods by a substantial margin across four public datasets. Code is available at https://github.com/SJTU-DeepVisionLab/FreeReal.
翻译:现有的场景文本检测方法通常依赖大量真实数据进行训练。由于标注真实图像的匮乏,近期研究尝试利用大规模标注合成数据进行文本检测器预训练。然而,合成域与真实域之间的差异限制了文本检测器的性能。本文提出FreeReal——一种真实域对齐的预训练范式,能够协同利用标注合成数据与未标注真实数据的互补优势。具体而言,为构建合成与真实世界间的预训练桥梁,我们设计了面向文本图像的基于字形结构的混合机制。该机制通过解析合成图像的字符结构,将其以类涂鸦单元的形式嵌入真实图像。在不引入真实域偏移的前提下,该机制可自由生成带有合成标注信息的真实场景图像。此外,当提供细粒度合成标注时,该机制能有效弥合以英语为主的合成数据与多语言真实数据之间的语言学域差异。实验表明,FreeReal在FCENet、PSENet、PANet和DBNet四种检测方法上分别实现了1.97%、3.90%、3.85%和4.56%的平均性能提升,在四个公开数据集上均显著优于现有预训练方法。代码已开源:https://github.com/SJTU-DeepVisionLab/FreeReal。