Contour based scene text detection methods have rapidly developed recently, but still suffer from inaccurate frontend contour initialization, multi-stage error accumulation, or deficient local information aggregation. To tackle these limitations, we propose a novel arbitrary-shaped scene text detection framework named CT-Net by progressive contour regression with contour transformers. Specifically, we first employ a contour initialization module that generates coarse text contours without any post-processing. Then, we adopt contour refinement modules to adaptively refine text contours in an iterative manner, which are beneficial for context information capturing and progressive global contour deformation. Besides, we propose an adaptive training strategy to enable the contour transformers to learn more potential deformation paths, and introduce a re-score mechanism that can effectively suppress false positives. Extensive experiments are conducted on four challenging datasets, which demonstrate the accuracy and efficiency of our CT-Net over state-of-the-art methods. Particularly, CT-Net achieves F-measure of 86.1 at 11.2 frames per second (FPS) and F-measure of 87.8 at 10.1 FPS for CTW1500 and Total-Text datasets, respectively.
翻译:基于轮廓的场景文本检测方法近期发展迅速,但仍面临前端轮廓初始化不准确、多阶段误差累积及局部信息聚合不足等问题。为解决这些局限,我们提出了一种名为CT-Net的新型任意形状场景文本检测框架,通过渐进式轮廓回归与轮廓变形器实现。具体而言,我们首先采用轮廓初始化模块,无需任何后处理即可生成粗粒度文本轮廓;进而引入轮廓精炼模块,以迭代方式自适应优化文本轮廓,有利于上下文信息捕获与渐进式全局轮廓变形。此外,我们提出自适应训练策略,使轮廓变形器能够学习更多潜在变形路径,并引入重评分机制有效抑制误检。在四个具有挑战性的数据集上进行了大量实验,结果表明我们的CT-Net在准确性与效率上均超越现有最优方法。特别地,CT-Net在CTW1500与Total-Text数据集上分别达到86.1的F值(11.2帧/秒)与87.8的F值(10.1帧/秒)。