Towards Robust Real-Time Scene Text Detection: From Semantic to Instance Representation Learning

Due to the flexible representation of arbitrary-shaped scene text and simple pipeline, bottom-up segmentation-based methods begin to be mainstream in real-time scene text detection. Despite great progress, these methods show deficiencies in robustness and still suffer from false positives and instance adhesion. Different from existing methods which integrate multiple-granularity features or multiple outputs, we resort to the perspective of representation learning in which auxiliary tasks are utilized to enable the encoder to jointly learn robust features with the main task of per-pixel classification during optimization. For semantic representation learning, we propose global-dense semantic contrast (GDSC), in which a vector is extracted for global semantic representation, then used to perform element-wise contrast with the dense grid features. To learn instance-aware representation, we propose to combine top-down modeling (TDM) with the bottom-up framework to provide implicit instance-level clues for the encoder. With the proposed GDSC and TDM, the encoder network learns stronger representation without introducing any parameters and computations during inference. Equipped with a very light decoder, the detector can achieve more robust real-time scene text detection. Experimental results on four public datasets show that the proposed method can outperform or be comparable to the state-of-the-art on both accuracy and speed. Specifically, the proposed method achieves 87.2% F-measure with 48.2 FPS on Total-Text and 89.6% F-measure with 36.9 FPS on MSRA-TD500 on a single GeForce RTX 2080 Ti GPU.

翻译：由于对任意形状场景文本的灵活表示和简洁的流水线，基于自底向上分割的方法开始成为实时场景文本检测的主流。尽管取得了巨大进展，但这些方法在鲁棒性方面存在缺陷，仍遭受误检和实例粘连的困扰。与现有集成多粒度特征或多输出的方法不同，我们从表示学习的角度出发，利用辅助任务使编码器在优化过程中与逐像素分类的主任务联合学习鲁棒特征。对于语义表示学习，我们提出全局密集语义对比（GDSC），其中提取一个向量用于全局语义表示，然后与密集网格特征进行逐元素对比。为了学习实例感知表示，我们提出将自顶向下建模（TDM）与自底向上框架相结合，为编码器提供隐式的实例级线索。通过所提出的GDSC和TDM，编码器网络学习到更强的表示，且在推理时不引入任何参数和计算量。配备极轻量的解码器，该检测器能够实现更鲁棒的实时场景文本检测。在四个公开数据集上的实验结果表明，所提出的方法在准确率和速度上均能优于或媲美现有最先进方法。具体而言，在单块GeForce RTX 2080 Ti GPU上，所提方法在Total-Text上达到87.2%的F值（48.2 FPS），在MSRA-TD500上达到89.6%的F值（36.9 FPS）。