Scene-text spotting is a task that predicts a text area on natural scene images and recognizes its text characters simultaneously. It has attracted much attention in recent years due to its wide applications. Existing research has mainly focused on improving text region detection, not text recognition. Thus, while detection accuracy is improved, the end-to-end accuracy is insufficient. Texts in natural scene images tend to not be a random string of characters but a meaningful string of characters, a word. Therefore, we propose adversarial learning of semantic representations for scene text spotting (A3S) to improve end-to-end accuracy, including text recognition. A3S simultaneously predicts semantic features in the detected text area instead of only performing text recognition based on existing visual features. Experimental results on publicly available datasets show that the proposed method achieves better accuracy than other methods.
翻译:场景文本识别是一项同时预测自然场景图像中文本区域并识别其字符的任务。由于应用广泛,该领域近年来备受关注。现有研究主要聚焦于提升文本区域检测的精度,而非文本识别的效果。因此,尽管检测准确率得到改善,但端到端的准确性仍显不足。自然场景图像中的文本往往并非随机字符序列,而是具有语义含义的字符序列——即词语。为此,我们提出面向场景文本识别的语义表征对抗学习(A3S),以提升包括文本识别在内的端到端准确性。A3S在检测到的文本区域中同步预测语义特征,而非仅基于现有视觉特征执行文本识别。在公开数据集上的实验结果表明,所提方法相较于其他方法取得了更优的准确率。