Employing a dictionary can efficiently rectify the deviation between the visual prediction and the ground truth in scene text recognition methods. However, the independence of the dictionary on the visual features may lead to incorrect rectification of accurate visual predictions. In this paper, we propose a new dictionary language model leveraging the Scene Image-Text Matching(SITM) network, which avoids the drawbacks of the explicit dictionary language model: 1) the independence of the visual features; 2) noisy choice in candidates etc. The SITM network accomplishes this by using Image-Text Contrastive (ITC) Learning to match an image with its corresponding text among candidates in the inference stage. ITC is widely used in vision-language learning to pull the positive image-text pair closer in feature space. Inspired by ITC, the SITM network combines the visual features and the text features of all candidates to identify the candidate with the minimum distance in the feature space. Our lexicon method achieves better results(93.8\% accuracy) than the ordinary method results(92.1\% accuracy) on six mainstream benchmarks. Additionally, we integrate our method with ABINet and establish new state-of-the-art results on several benchmarks.
翻译:采用词典能有效纠正场景文本识别方法中视觉预测与真实标注之间的偏差。然而,词典对视觉特征的独立性可能导致对准确视觉预测的错误纠正。本文提出一种利用场景图文匹配(SITM)网络的新型词典语言模型,该模型避免了显式词典语言模型的两大缺陷:1)对视觉特征的独立性;2)候选词中存在的噪声选择等。SITM网络通过在推理阶段利用图文对比(ITC)学习将图像与候选文本中的对应文本进行匹配来实现这一目标。ITC被广泛用于视觉-语言学习中,在特征空间中拉近正图文对的距离。受ITC启发,SITM网络融合所有候选词的视觉特征与文本特征,以识别特征空间中距离最小的候选词。我们的词典方法在六大主流基准数据集上取得了优于常规方法的结果(准确率93.8%对比92.1%)。此外,我们将该方法与ABINet集成,在多个基准数据集上创下新的最佳性能记录。