Recently, Transformer-based text detection techniques have sought to predict polygons by encoding the coordinates of individual boundary vertices using distinct query features. However, this approach incurs a significant memory overhead and struggles to effectively capture the intricate relationships between vertices belonging to the same instance. Consequently, irregular text layouts often lead to the prediction of outlined vertices, diminishing the quality of results. To address these challenges, we present an innovative approach rooted in Sparse R-CNN: a cascade decoding pipeline for polygon prediction. Our method ensures precision by iteratively refining polygon predictions, considering both the scale and location of preceding results. Leveraging this stabilized regression pipeline, even employing just a single feature vector to guide polygon instance regression yields promising detection results. Simultaneously, the leverage of instance-level feature proposal substantially enhances memory efficiency (>50% less vs. the state-of-the-art method DPText-DETR) and reduces inference speed (>40% less vs. DPText-DETR) with minor performance drop on benchmarks.
翻译:近期,基于Transformer的文本检测技术通过使用不同查询特征对每个边界顶点坐标进行编码,来预测多边形。然而,该方法会带来显著的内存开销,且难以有效捕捉同一实例内顶点间的复杂关系。因此,不规则的文本布局常导致预测出轮廓顶点,降低了结果质量。为解决这些问题,我们提出一种基于Sparse R-CNN的创新方法:一种用于多边形预测的级联解码流程。该方法通过迭代优化多边形预测,并综合考虑前序结果的尺度与位置,确保了预测精度。借助这一稳定的回归流程,即使仅使用单一特征向量来引导多边形实例回归,也能获得有前景的检测结果。同时,利用实例级特征提议显著提升了内存效率(相较于最先进方法DPText-DETR节省超过50%),并将推理速度降低了超过40%(相较于DPText-DETR),且在基准测试中性能下降极小。