Arbitrary Shape Text Detection via Boundary Transformer

from arxiv, It is an extend version (TextBPN++) to our preliminary conference version TextBPN(ICCV 2021), which has been accepted by IEEE Transactions on Multimedia (T-MM 2023). arXiv admin note: text overlap with arXiv:2107.12664

In arbitrary shape text detection, locating accurate text boundaries is challenging and non-trivial. Existing methods often suffer from indirect text boundary modeling or complex post-processing. In this paper, we systematically present a unified coarse-to-fine framework via boundary learning for arbitrary shape text detection, which can accurately and efficiently locate text boundaries without post-processing.In our method, we explicitly model the text boundary via an innovative iterative boundary transformer in a coarse-to-fine manner. In this way, our method can directly gain accurate text boundaries and abandon complex post-processing to improve efficiency. Specifically, our method mainly consists of a feature extraction backbone, a boundary proposal module, and an iteratively optimized boundary transformer module. The boundary proposal module consisting of multi-layer dilated convolutions will compute important prior information (including classification map, distance field, and direction field) for generating coarse boundary proposals while guiding the boundary transformer's optimization. The boundary transformer module adopts an encoder-decoder structure, in which the encoder is constructed by multi-layer transformer blocks with residual connection while the decoder is a simple multi-layer perceptron network (MLP). Under the guidance of prior information, the boundary transformer module will gradually refine the coarse boundary proposals via iterative boundary deformation. Furthermore, we propose a novel boundary energy loss (BEL) which introduces an energy minimization constraint and an energy monotonically decreasing constraint to further optimize and stabilize the learning of boundary refinement. Extensive experiments on publicly available and challenging datasets demonstrate the state-of-the-art performance and promising efficiency of our method.

翻译：在任意形状文本检测中，精确定位文本边界具有挑战性且非平凡。现有方法通常存在间接的文本边界建模或复杂的后处理问题。本文系统性地提出了一种基于边界学习的统一粗到细框架，能够在无需后处理的情况下准确高效地定位文本边界。该方法通过创新的迭代边界Transformer以粗到细的方式显式建模文本边界，从而直接获取精确的文本边界，并摒弃复杂后处理以提升效率。具体而言，本方法主要由特征提取主干网络、边界提议模块和迭代优化的边界Transformer模块构成。边界提议模块采用多层空洞卷积，计算重要的先验信息（包括分类图、距离场和方向场），用于生成粗略边界提议并指导边界Transformer的优化。边界Transformer模块采用编码器-解码器结构，其中编码器由带残差连接的多层Transformer块构成，解码器则为简单的多层感知机网络（MLP）。在先验信息引导下，边界Transformer模块通过迭代边界变形逐步细化粗略边界提议。此外，我们提出了一种新颖的边界能量损失（BEL），引入能量最小化约束和能量单调递减约束，以进一步优化并稳定边界细化的学习过程。在公开且具有挑战性的数据集上的大量实验表明，本方法具有最先进的性能和良好的效率。