Connected component (CC) is a proper text shape representation that aligns with human reading intuition. However, CC-based text detection methods have recently faced a developmental bottleneck that their time-consuming post-processing is difficult to eliminate. To address this issue, we introduce an explicit relational reasoning network (ERRNet) to elegantly model the component relationships without post-processing. Concretely, we first represent each text instance as multiple ordered text components, and then treat these components as objects in sequential movement. In this way, scene text detection can be innovatively viewed as a tracking problem. From this perspective, we design an end-to-end tracking decoder to achieve a CC-based method dispensing with post-processing entirely. Additionally, we observe that there is an inconsistency between classification confidence and localization quality, so we propose a Polygon Monte-Carlo method to quickly and accurately evaluate the localization quality. Based on this, we introduce a position-supervised classification loss to guide the task-aligned learning of ERRNet. Experiments on challenging benchmarks demonstrate the effectiveness of our ERRNet. It consistently achieves state-of-the-art accuracy while holding highly competitive inference speed.
翻译:连通分量是一种符合人类阅读直觉的恰当文本形状表示方法。然而,基于连通分量的文本检测方法近期面临发展瓶颈,其耗时的后处理步骤难以消除。为解决这一问题,我们提出一种显式关系推理网络,无需后处理即可优雅地建模分量间关系。具体而言,我们首先将每个文本实例表示为多个有序文本分量,然后将这些分量视为序列运动中的对象。通过这种方式,场景文本检测可创新性地视为跟踪问题。基于该视角,我们设计了一个端到端的跟踪解码器,实现了完全无需后处理的连通分量检测方法。此外,我们观察到分类置信度与定位质量之间存在不一致性,因此提出多边形蒙特卡洛方法以快速精准评估定位质量。在此基础上,我们引入位置监督分类损失以指导ERRNet的任务对齐学习。在多个挑战性基准测试上的实验证明了ERRNet的有效性。该方法在保持极具竞争力推理速度的同时,持续取得最先进的检测精度。