Multi-modal models show appealing performance in visual recognition tasks recently, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models are either inefficient or cannot be trivially upgraded to scene text recognition (STR) due to the composition difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises $\left \langle condition,question,answer\right \rangle$ instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops lightweight instruction encoder, cross-modal feature fusion module and multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that considerably differs from current methods. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and efficient inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of both rarely appearing and morphologically similar characters, which were previous challenges. Code at \href{https://github.com/Topdu/OpenOCR}{this http URL}.
翻译:近年来,多模态模型在视觉识别任务中展现出引人注目的性能,因为自由形式的文本引导训练激发了理解细粒度视觉内容的能力。然而,由于自然图像与文本图像在构成上的差异,现有模型要么效率低下,要么无法直接升级用于场景文本识别任务。本文提出了一种新颖的指令引导场景文本识别范式,将场景文本识别构建为一个指令学习问题,通过预测字符属性来理解文本图像。该范式首先设计 $\left \langle 条件,问题,答案\right \rangle$ 指令三元组,为字符属性提供丰富多样的描述。为了通过问答有效学习这些属性,该范式开发了轻量级指令编码器、跨模态特征融合模块和多任务答案头,以指导对文本图像的细致理解。此外,该范式仅通过使用不同的指令即可实现不同的识别流程,从而建立了一种基于字符理解的文本推理范式,与现有方法有显著区别。在英文和中文基准测试上的实验表明,该范式以显著优势超越现有模型,同时保持较小的模型规模和高效的推理速度。此外,通过调整指令采样,该范式提供了一种优雅的方法来解决罕见字符和形态相似字符的识别难题。代码发布于 \href{https://github.com/Topdu/OpenOCR}{此网址}。