Revisiting Scene Text Recognition: A Data Perspective

This paper aims to re-assess scene text recognition (STR) from a data-oriented perspective. We begin by revisiting the six commonly used benchmarks in STR and observe a trend of performance saturation, whereby only 2.91% of the benchmark images cannot be accurately recognized by an ensemble of 13 representative models. While these results are impressive and suggest that STR could be considered solved, however, we argue that this is primarily due to the less challenging nature of the common benchmarks, thus concealing the underlying issues that STR faces. To this end, we consolidate a large-scale real STR dataset, namely Union14M, which comprises 4 million labeled images and 10 million unlabeled images, to assess the performance of STR models in more complex real-world scenarios. Our experiments demonstrate that the 13 models can only achieve an average accuracy of 66.53% on the 4 million labeled images, indicating that STR still faces numerous challenges in the real world. By analyzing the error patterns of the 13 models, we identify seven open challenges in STR and develop a challenge-driven benchmark consisting of eight distinct subsets to facilitate further progress in the field. Our exploration demonstrates that STR is far from being solved and leveraging data may be a promising solution. In this regard, we find that utilizing the 10 million unlabeled images through self-supervised pre-training can significantly improve the robustness of STR model in real-world scenarios and leads to state-of-the-art performance.

翻译：本文旨在从数据导向的视角重新评估场景文本识别（STR）。我们首先重新审视了STR中常用的六个基准测试集，观察到性能趋于饱和的趋势——在13个代表性模型的集成中，仅有2.91%的基准图像无法被准确识别。尽管这些结果令人印象深刻，并暗示STR可能已被认为解决，但我们认为这主要归因于常见基准测试集的挑战性不足，从而掩盖了STR所面临的潜在问题。为此，我们整合了一个大规模真实STR数据集Union14M，包含400万张标注图像和1000万张未标注图像，用以评估STR模型在更复杂真实场景中的表现。实验表明，13个模型在400万张标注图像上平均准确率仅为66.53%，这揭示STR在现实世界中仍面临诸多挑战。通过分析13个模型的错误模式，我们识别出STR中的七个开放性挑战，并开发了一个由八个不同子集构成的挑战驱动型基准测试集，以推动该领域的进一步发展。我们的探索表明，STR远未得到解决，而利用数据可能是一个有前景的解决方案。在此方面，我们发现通过自监督预训练利用1000万张未标注图像，能够显著提升STR模型在真实场景中的鲁棒性，并达到当前最优性能。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日