Text segmentation tasks have a very wide range of application values, such as image editing, style transfer, watermark removal, etc.However, existing public datasets are of poor quality of pixel-level labels that have been shown to be notoriously costly to acquire, both in terms of money and time. At the same time, when pretraining is performed on synthetic datasets, the data distribution of the synthetic datasets is far from the data distribution in the real scene. These all pose a huge challenge to the current pixel-level text segmentation algorithms.To alleviate the above problems, we propose a self-supervised scene text segmentation algorithm with layered decoupling of representations derived from the object-centric manner to segment images into texts and background. In our method, we propose two novel designs which include Region Query Module and Representation Consistency Constraints adapting to the unique properties of text as complements to Auto Encoder, which improves the network's sensitivity to texts.For this unique design, we treat the polygon-level masks predicted by the text localization model as extra input information, and neither utilize any pixel-level mask annotations for training stage nor pretrain on synthetic datasets.Extensive experiments show the effectiveness of the method proposed. On several public scene text datasets, our method outperforms the state-of-the-art unsupervised segmentation algorithms.
翻译:文本分割任务具有非常广泛的应用价值,例如图像编辑、风格迁移、水印去除等。然而,现有的公共数据集在像素级标签方面质量较差,这些标签的获取在资金和时间上均被证实成本高昂。同时,在合成数据集上进行预训练时,合成数据集的数据分布与真实场景中的数据分布相差甚远。这些都给当前的像素级文本分割算法带来了巨大挑战。为缓解上述问题,我们提出了一种自监督场景文本分割算法,该算法通过以对象为中心的分层解耦表征,将图像分割为文本和背景。在我们的方法中,我们提出了两种新颖设计,包括区域查询模块和表征一致性约束,以适应文本的独特属性,作为自编码器的补充,从而提升网络对文本的敏感度。针对这一独特设计,我们将文本定位模型预测的多边形级掩码作为额外输入信息,在训练阶段既不使用任何像素级掩码标注,也不在合成数据集上进行预训练。大量实验证明了所提出方法的有效性。在多个公开场景文本数据集上,我们的方法优于当前最先进的无监督分割算法。