We present PBFormer, an efficient yet powerful scene text detector that unifies the transformer with a novel text shape representation Polynomial Band (PB). The representation has four polynomial curves to fit a text's top, bottom, left, and right sides, which can capture a text with a complex shape by varying polynomial coefficients. PB has appealing features compared with conventional representations: 1) It can model different curvatures with a fixed number of parameters, while polygon-points-based methods need to utilize a different number of points. 2) It can distinguish adjacent or overlapping texts as they have apparent different curve coefficients, while segmentation-based or points-based methods suffer from adhesive spatial positions. PBFormer combines the PB with the transformer, which can directly generate smooth text contours sampled from predicted curves without interpolation. A parameter-free cross-scale pixel attention (CPA) module is employed to highlight the feature map of a suitable scale while suppressing the other feature maps. The simple operation can help detect small-scale texts and is compatible with the one-stage DETR framework, where no postprocessing exists for NMS. Furthermore, PBFormer is trained with a shape-contained loss, which not only enforces the piecewise alignment between the ground truth and the predicted curves but also makes curves' positions and shapes consistent with each other. Without bells and whistles about text pre-training, our method is superior to the previous state-of-the-art text detectors on the arbitrary-shaped text datasets.
翻译:我们提出PBFormer,一种高效且强大的场景文本检测器,它将变压器与一种新颖的文本形状表示——多项式带(PB)相统一。该表示使用四个多项式曲线拟合文本的顶部、底部、左侧和右侧边界,通过改变多项式系数可捕捉具有复杂形状的文本。与传统表示相比,PB具有以下优势:1) 它能够以固定数量的参数建模不同曲率,而基于多边形点的方法需利用不同数量的点;2) 它能区分邻近或重叠文本,因其具有明显不同的曲线系数,而基于分割或点的方法则受限于粘连的空间位置。PBFormer将PB与变压器结合,可直接从预测曲线中采样生成平滑文本轮廓,无需插值。采用无参数的跨尺度像素注意力(CPA)模块,在抑制其他特征图的同时突出合适尺度的特征图。这种简单操作有助于检测小尺度文本,且兼容无需后处理(如NMS)的单阶段DETR框架。此外,PBFormer采用形状约束损失进行训练,既强制真实边界与预测曲线间的分段对齐,又使曲线的位置和形状相互一致。无需任何文本预训练技巧,我们的方法在任意形状文本数据集上优于先前最先进的文本检测器。