Pre-trained vision-language models are the de-facto foundation models for various downstream tasks. However, this trend has not extended to the field of scene text recognition (STR), despite the potential of CLIP to serve as a powerful scene text reader. CLIP can robustly identify regular (horizontal) and irregular (rotated, curved, blurred, or occluded) text in natural images. With such merits, we introduce CLIP4STR, a simple yet effective STR method built upon image and text encoders of CLIP. It has two encoder-decoder branches: a visual branch and a cross-modal branch. The visual branch provides an initial prediction based on the visual feature, and the cross-modal branch refines this prediction by addressing the discrepancy between the visual feature and text semantics. To fully leverage the capabilities of both branches, we design a dual predict-and-refine decoding scheme for inference. CLIP4STR achieves new state-of-the-art performance on 11 STR benchmarks. Additionally, a comprehensive empirical study is provided to enhance the understanding of the adaptation of CLIP to STR. We believe our method establishes a simple but strong baseline for future STR research with VL models.
翻译:预训练的视觉语言模型是多种下游任务的事实基础模型。然而,这一趋势尚未扩展到场景文本识别(STR)领域,尽管CLIP具有成为强大场景文本阅读器的潜力。CLIP能够稳健地识别自然图像中的规则(水平)和不规则(旋转、弯曲、模糊或遮挡)文本。基于这些优势,我们提出了CLIP4STR,这是一种简单而有效的STR方法,构建于CLIP的图像和文本编码器之上。它包含两个编码器-解码器分支:视觉分支和跨模态分支。视觉分支基于视觉特征提供初始预测,而跨模态分支通过解决视觉特征与文本语义之间的差异来细化这一预测。为充分利用两个分支的能力,我们设计了一种用于推理的双重预测与细化解码方案。CLIP4STR在11个STR基准测试中达到了新的最佳性能。此外,我们提供了全面的实证研究,以加深对CLIP适应STR的理解。我们相信,我们的方法为未来基于VL模型的STR研究建立了一个简单但强大的基线。