Text recognition in the wild is a long-standing problem in computer vision. Driven by end-to-end deep learning, recent studies suggest vision and language processing are effective for scene text recognition. Yet, solving edit errors such as add, delete, or replace is still the main challenge for existing approaches. In fact, the content of the text and its audio are naturally corresponding to each other, i.e., a single character error may result in a clear different pronunciation. In this paper, we propose the AudioOCR, a simple yet effective probabilistic audio decoder for mel spectrogram sequence prediction to guide the scene text recognition, which only participates in the training phase and brings no extra cost during the inference stage. The underlying principle of AudioOCR can be easily applied to the existing approaches. Experiments using 7 previous scene text recognition methods on 12 existing regular, irregular, and occluded benchmarks demonstrate our proposed method can bring consistent improvement. More importantly, through our experimentation, we show that AudioOCR possesses a generalizability that extends to more challenging scenarios, including recognizing non-English text, out-of-vocabulary words, and text with various accents. Code will be available at https://github.com/wenwenyu/AudioOCR.
翻译:自然场景中的文本识别是计算机视觉领域的一个长期难题。在端到端深度学习的推动下,近期研究表明视觉与语言处理对场景文本识别有效。然而,解决编辑错误(如添加、删除或替换)仍是现有方法面临的主要挑战。事实上,文本内容与其音频天然对应——例如,单个字符错误可能导致显著的发音差异。本文提出AudioOCR,一个简单而高效的基于概率的音频解码器,用于梅尔频谱序列预测以引导场景文本识别,该模块仅在训练阶段参与,推理阶段不引入额外开销。AudioOCR的核心原理可轻松应用于现有方法。通过在12个常规、不规则及遮挡基准测试上对7种先前场景文本识别方法进行实验,表明我们的方法能带来一致的性能提升。更重要的是,实验证明AudioOCR具备泛化能力,可拓展至更具挑战性的场景,包括非英语文本、词汇表外单词及多样化口音文本的识别。代码将发布于 https://github.com/wenwenyu/AudioOCR。