Existing techniques for text detection can be broadly classified into two primary groups: segmentation-based methods and regression-based methods. Segmentation models offer enhanced robustness to font variations but require intricate post-processing, leading to high computational overhead. Regression-based methods undertake instance-aware prediction but face limitations in robustness and data efficiency due to their reliance on high-level representations. In our academic pursuit, we propose SRFormer, a unified DETR-based model with amalgamated Segmentation and Regression, aiming at the synergistic harnessing of the inherent robustness in segmentation representations, along with the straightforward post-processing of instance-level regression. Our empirical analysis indicates that favorable segmentation predictions can be obtained at the initial decoder layers. In light of this, we constrain the incorporation of segmentation branches to the first few decoder layers and employ progressive regression refinement in subsequent layers, achieving performance gains while minimizing additional computational load from the mask. Furthermore, we propose a Mask-informed Query Enhancement module. We take the segmentation result as a natural soft-ROI to pool and extract robust pixel representations, which are then employed to enhance and diversify instance queries. Extensive experimentation across multiple benchmarks has yielded compelling findings, highlighting our method's exceptional robustness, superior training and data efficiency, as well as its state-of-the-art performance.
翻译:现有文本检测技术主要分为两类:基于分割的方法和基于回归的方法。分割模型对字体变化具有更强的鲁棒性,但需要复杂的后处理,导致计算开销较高。基于回归的方法支持实例感知预测,但因依赖高层表征而在鲁棒性和数据效率方面存在局限。本研究提出SRFormer——一种统一融合分割与回归的DETR框架,旨在协同利用分割表征固有的鲁棒性,以及实例级回归简洁的后处理流程。实证分析表明,解码器早期层即可获得令人满意的分割预测。基于此,我们将分割分支的引入约束在前几层解码器,并在后续层采用渐进式回归精化策略,在提升性能的同时最大限度降低掩码带来的额外计算负担。此外,我们提出掩码感知查询增强模块:将分割结果视为自然软ROI进行池化,提取鲁棒像素表征,进而增强和多样化实例查询。在多个基准上的广泛实验取得了令人信服的结果,突显了本方法卓越的鲁棒性、优异的训练与数据效率以及最先进的性能。