This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective. We begin by revisiting the six commonly used benchmarks in STR and observe a trend of performance saturation, whereby only 2.91% of the benchmark images cannot be accurately recognized by an ensemble of 13 representative models. While these results are impressive and suggest that STR could be considered solved, however, we argue that this is primarily due to the less challenging nature of the common benchmarks, thus concealing the underlying issues that STR faces. To this end, we consolidate a large-scale real STR dataset, namely Union14M, which comprises 4 million labeled images and 10 million unlabeled images, to assess the performance of STR models in more complex real-world scenarios. Our experiments demonstrate that the 13 models can only achieve an average accuracy of 66.53% on the 4 million labeled images, indicating that STR still faces numerous challenges in the real world. By analyzing the error patterns of the 13 models, we identify seven open challenges in STR and develop a challenge-driven benchmark consisting of eight distinct subsets to facilitate further progress in the field. Our exploration demonstrates that STR is far from being solved and leveraging data may be a promising solution. In this regard, we find that utilizing the 10 million unlabeled images through self-supervised pre-training can significantly improve the robustness of STR model in real-world scenarios and leads to state-of-the-art performance.
翻译:本文旨在从数据导向的视角重新评估场景文本识别(STR)。我们首先重新审视了STR中常用的六个基准测试集,观察到性能趋于饱和的趋势——在13个代表性模型的集成中,仅有2.91%的基准图像无法被准确识别。尽管这些结果令人印象深刻,并暗示STR可能已被认为解决,但我们认为这主要归因于常见基准测试集的挑战性不足,从而掩盖了STR所面临的潜在问题。为此,我们整合了一个大规模真实STR数据集Union14M,包含400万张标注图像和1000万张未标注图像,用以评估STR模型在更复杂真实场景中的表现。实验表明,13个模型在400万张标注图像上平均准确率仅为66.53%,这揭示STR在现实世界中仍面临诸多挑战。通过分析13个模型的错误模式,我们识别出STR中的七个开放性挑战,并开发了一个由八个不同子集构成的挑战驱动型基准测试集,以推动该领域的进一步发展。我们的探索表明,STR远未得到解决,而利用数据可能是一个有前景的解决方案。在此方面,我们发现通过自监督预训练利用1000万张未标注图像,能够显著提升STR模型在真实场景中的鲁棒性,并达到当前最优性能。