Arbitrary Shape Text Detection via Boundary Transformer

from arxiv, It is an extend version (TextBPN++) to our preliminary conference version TextBPN(ICCV 2021) [arXiv:2107.12664], which has been accepted by IEEE Transactions on Multimedia (T-MM 2023)

In arbitrary shape text detection, locating accurate text boundaries is challenging and non-trivial. Existing methods often suffer from indirect text boundary modeling or complex post-processing. In this paper, we systematically present a unified coarse-to-fine framework via boundary learning for arbitrary shape text detection, which can accurately and efficiently locate text boundaries without post-processing. In our method, we explicitly model the text boundary via an innovative iterative boundary transformer in a coarse-to-fine manner. In this way, our method can directly gain accurate text boundaries and abandon complex post-processing to improve efficiency. Specifically, our method mainly consists of a feature extraction backbone, a boundary proposal module, and an iteratively optimized boundary transformer module. The boundary proposal module consisting of multi-layer dilated convolutions will compute important prior information (including classification map, distance field, and direction field) for generating coarse boundary proposals while guiding the boundary transformer's optimization. The boundary transformer module adopts an encoder-decoder structure, in which the encoder is constructed by multi-layer transformer blocks with residual connection while the decoder is a simple multi-layer perceptron network (MLP). Under the guidance of prior information, the boundary transformer module will gradually refine the coarse boundary proposals via iterative boundary deformation. Furthermore, we propose a novel boundary energy loss (BEL) which introduces an energy minimization constraint and an energy monotonically decreasing constraint to further optimize and stabilize the learning of boundary refinement. Extensive experiments on publicly available and challenging datasets demonstrate the state-of-the-art performance and promising efficiency of our method.

翻译：在任意形状文本检测中，准确定位文本边界具有挑战性且并非易事。现有方法常面临间接文本边界建模或复杂后处理的问题。本文系统性地提出了一种统一的由粗到精的边界学习框架，用于任意形状文本检测，可在无需后处理的情况下准确高效地定位文本边界。该方法通过创新的迭代式边界变换器，以由粗到精的方式显式建模文本边界。由此，本方法可直接获得准确的文本边界，摒弃复杂的后处理以提升效率。具体而言，本方法主要由特征提取主干网络、边界提议模块和迭代优化的边界变换器模块构成。边界提议模块采用多层膨胀卷积，可计算包含分类图、距离场和方向场在内的重要先验信息，用于生成粗粒度边界提议，同时指导边界变换器的优化。边界变换器模块采用编码器-解码器结构，其中编码器由带残差连接的多层变换器块构成，解码器则为简单的多层感知机网络（MLP）。在先验信息引导下，边界变换器模块通过迭代式边界形变逐步优化粗粒度边界提议。此外，我们提出新型边界能量损失（BEL），引入能量最小化约束与能量单调递减约束，进一步优化并稳定边界细化的学习过程。在公开且具有挑战性的数据集上进行大量实验，验证了本方法的最优性能与卓越效率。