Arbitrary Reading Order Scene Text Spotter with Local Semantics Guidance

Scene text spotting has attracted the enthusiasm of relative researchers in recent years. Most existing scene text spotters follow the detection-then-recognition paradigm, where the vanilla detection module hardly determines the reading order and leads to failure recognition. After rethinking the auto-regressive scene text recognition method, we find that a well-trained recognizer can implicitly perceive the local semantics of all characters in a complete word or a sentence without a character-level detection module. Local semantic knowledge not only includes text content but also spatial information in the right reading order. Motivated by the above analysis, we propose the Local Semantics Guided scene text Spotter (LSGSpotter), which auto-regressively decodes the position and content of characters guided by the local semantics. Specifically, two effective modules are proposed in LSGSpotter. On the one hand, we design a Start Point Localization Module (SPLM) for locating text start points to determine the right reading order. On the other hand, a Multi-scale Adaptive Attention Module (MAAM) is proposed to adaptively aggregate text features in a local area. In conclusion, LSGSpotter achieves the arbitrary reading order spotting task without the limitation of sophisticated detection, while alleviating the cost of computational resources with the grid sampling strategy. Extensive experiment results show LSGSpotter achieves state-of-the-art performance on the InverseText benchmark. Moreover, our spotter demonstrates superior performance on English benchmarks for arbitrary-shaped text, achieving improvements of 0.7\% and 2.5\% on Total-Text and SCUT-CTW1500, respectively. These results validate our text spotter is effective for scene texts in arbitrary reading order and shape.

翻译：近年来，场景文本检测识别技术引起了相关研究者的广泛关注。现有的大多数场景文本检测识别系统遵循"先检测后识别"的范式，其中传统的检测模块难以确定文本的阅读顺序，从而导致识别失败。通过对自回归场景文本识别方法的重新思考，我们发现训练良好的识别器能够在没有字符级检测模块的情况下，隐式感知完整单词或句子中所有字符的局部语义信息。局部语义知识不仅包含文本内容，还包含正确阅读顺序下的空间信息。基于上述分析，我们提出了局部语义引导的场景文本检测识别系统（LSGSpotter），该系统通过局部语义引导自回归解码字符的位置和内容。具体而言，LSGSpotter包含两个有效模块：一方面，我们设计了起始点定位模块（SPLM）用于确定文本起始点以保障正确的阅读顺序；另一方面，提出了多尺度自适应注意力模块（MAAM）来自适应地聚合局部区域的文本特征。综上所述，LSGSpotter通过网格采样策略实现了无需复杂检测限制的任意阅读顺序文本检测识别任务，同时降低了计算资源消耗。大量实验结果表明，LSGSpotter在InverseText基准测试中取得了最先进的性能。此外，我们的系统在任意形状英文文本基准测试中表现出优越性能，在Total-Text和SCUT-CTW1500数据集上分别实现了0.7%和2.5%的性能提升。这些结果验证了我们的文本检测识别系统对任意阅读顺序和任意形状的场景文本都具有显著效果。