In text recognition, self-supervised pre-training emerges as a good solution to reduce dependence on expansive annotated real data. Previous studies primarily focus on local visual representation by leveraging mask image modeling or sequence contrastive learning. However, they omit modeling the linguistic information in text images, which is crucial for recognizing text. To simultaneously capture local character features and linguistic information in visual space, we propose Symmetric Superimposition Modeling (SSM). The objective of SSM is to reconstruct the direction-specific pixel and feature signals from the symmetrically superimposed input. Specifically, we add the original image with its inverted views to create the symmetrically superimposed inputs. At the pixel level, we reconstruct the original and inverted images to capture character shapes and texture-level linguistic context. At the feature level, we reconstruct the feature of the same original image and inverted image with different augmentations to model the semantic-level linguistic context and the local character discrimination. In our design, we disrupt the character shape and linguistic rules. Consequently, the dual-level reconstruction facilitates understanding character shapes and linguistic information from the perspective of visual texture and feature semantics. Experiments on various text recognition benchmarks demonstrate the effectiveness and generality of SSM, with 4.1% average performance gains and 86.6% new state-of-the-art average word accuracy on Union14M benchmarks.
翻译:在文本识别中,自监督预训练是减少对大规模标注真实数据依赖的有效方案。以往研究主要利用掩码图像建模或序列对比学习聚焦局部视觉表征,却忽视了文本图像中语言信息的建模——而这正是文本识别的关键要素。为同时捕捉视觉空间中的局部字符特征与语言信息,我们提出对称叠加建模(SSM)。SSM的目标是从对称叠加输入中重构方向特异性像素与特征信号。具体而言,我们将原始图像与其倒置视图叠加生成对称混合输入。在像素层级,通过重构原始图像与倒置图像来捕获字符形态及纹理级语言上下文;在特征层级,对经不同数据增强的同一原始图像与倒置图像进行特征重构,以建模语义级语言上下文与局部字符判别性。我们的设计破坏了字符形态与语言规则,从而促使双层级重构从视觉纹理与特征语义两个维度理解字符形态与语言信息。在多个文本识别基准上的实验表明,SSM具备有效性与泛化性,在Union14M基准上平均性能提升4.1%,平均单词准确率达86.6%,刷新了当前最优水平。