Due to the diversity of scene text in aspects such as font, color, shape, and size, accurately and efficiently detecting text is still a formidable challenge. Among the various detection approaches, segmentation-based approaches have emerged as prominent contenders owing to their flexible pixel-level predictions. However, these methods typically model text instances in a bottom-up manner, which is highly susceptible to noise. In addition, the prediction of pixels is isolated without introducing pixel-feature interaction, which also influences the detection performance. To alleviate these problems, we propose a multi-information level arbitrary-shaped text detector consisting of a focus entirety module (FEM) and a perceive environment module (PEM). The former extracts instance-level features and adopts a top-down scheme to model texts to reduce the influence of noises. Specifically, it assigns consistent entirety information to pixels within the same instance to improve their cohesion. In addition, it emphasizes the scale information, enabling the model to distinguish varying scale texts effectively. The latter extracts region-level information and encourages the model to focus on the distribution of positive samples in the vicinity of a pixel, which perceives environment information. It treats the kernel pixels as positive samples and helps the model differentiate text and kernel features. Extensive experiments demonstrate the FEM's ability to efficiently support the model in handling different scale texts and confirm the PEM can assist in perceiving pixels more accurately by focusing on pixel vicinities. Comparisons show the proposed model outperforms existing state-of-the-art approaches on four public datasets.
翻译:由于场景文本在字体、颜色、形状和大小等方面的多样性,准确且高效地检测文本仍然是一项艰巨的挑战。在各种检测方法中,基于分割的方法因其灵活的像素级预测能力而成为突出的竞争者。然而,这些方法通常以自底向上的方式建模文本实例,极易受到噪声干扰。此外,像素预测过程孤立,未引入像素特征交互,这也影响了检测性能。为缓解这些问题,我们提出了一种多信息层级的任意形状文本检测器,由聚焦整体模块(FEM)和感知环境模块(PEM)组成。前者提取实例级特征,并采用自顶向下的方案建模文本以减少噪声影响。具体而言,它为同一实例内的像素分配一致的整体信息以增强其内聚性。此外,它强调尺度信息,使模型能有效区分不同尺度的文本。后者提取区域级信息,促使模型关注像素邻近区域正样本的分布,从而感知环境信息。它将核像素视为正样本,帮助模型区分文本特征与核特征。大量实验证明FEM能有效支持模型处理不同尺度的文本,并证实PEM可通过关注像素邻域来辅助更准确地感知像素。对比实验表明,所提模型在四个公开数据集上优于现有的先进方法。