Pre-trained vision-language models~(VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. We scale CLIP4STR in terms of the model size, pre-training data, and training data, achieving state-of-the-art performance on 11 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. We believe our method establishes a simple yet strong baseline for future STR research with VLMs.
翻译:预训练视觉语言模型(VLM)已成为各类下游任务的事实基础模型。然而,尽管VLM具备成为强大场景文本阅读器的潜力,场景文本识别方法仍倾向于采用在单一模态(即视觉模态)上预训练的主干网络。例如,CLIP能够鲁棒地识别图像中的常规(水平)文本与非常规(旋转、弯曲、模糊或遮挡)文本。基于此优势,我们将CLIP改造为场景文本阅读器,提出CLIP4STR——一种基于CLIP图像编码器与文本编码器的简单而高效的STR方法。该方法包含两个编码器-解码器分支:视觉分支与跨模态分支。视觉分支基于视觉特征提供初始预测,而跨模态分支通过消除视觉特征与文本语义之间的差异来优化该预测。为充分利用两个分支的能力,我们设计了一种用于推理的双重预测-精化解码方案。我们从模型规模、预训练数据及训练数据三个维度对CLIP4STR进行扩展,在11个STR基准测试中均取得了最先进性能。此外,通过全面的实证研究,我们深化了对CLIP适配至STR领域的理解。我们相信,该方法为未来基于VLM的STR研究建立了一个简单而强大的基线。