Pre-trained vision-language models~(VLMs) are the de-facto foundation models for various downstream tasks. However, scene text recognition methods still prefer backbones pre-trained on a single modality, namely, the visual modality, despite the potential of VLMs to serve as powerful scene text readers. For example, CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in images. With such merits, we transform CLIP into a scene text reader and introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. CLIP4STR achieves new state-of-the-art performance on 11 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. We believe our method establishes a simple but strong baseline for future STR research with VLMs.
翻译:预训练视觉语言模型(VLM)已成为各类下游任务的事实标准基础模型。然而,尽管VLM具有作为强大场景文本阅读器的潜力,场景文本识别方法仍倾向于使用单模态(即视觉模态)预训练的主干网络。例如,CLIP能够稳健识别图像中的规则(水平)文本和不规则(旋转、弯曲、模糊或遮挡)文本。基于此优势,我们将CLIP改造为场景文本阅读器,并提出了CLIP4STR——一种基于CLIP图像编码器和文本编码器的简单而有效的场景文本识别方法。该方法包含两个编码器-解码器分支:视觉分支和跨模态分支。视觉分支基于视觉特征提供初始预测,跨模态分支则通过消除视觉特征与文本语义之间的差异来优化该预测。为充分利用两个分支的能力,我们设计了用于推理的双重预测-精化解码方案。CLIP4STR在11个场景文本识别基准上取得了新的最优性能。此外,我们提供了全面的实证研究以深化对CLIP适应于场景文本识别这一过程的理解。我们相信,该方法为未来基于VLM的场景文本识别研究建立了一个简单且强大的基线。