Instruction-Guided Scene Text Recognition

Multi-modal models show appealing performance in visual recognition tasks recently, as free-form text-guided training evokes the ability to understand fine-grained visual content. However, current models are either inefficient or cannot be trivially upgraded to scene text recognition (STR) due to the composition difference between natural and text images. We propose a novel instruction-guided scene text recognition (IGTR) paradigm that formulates STR as an instruction learning problem and understands text images by predicting character attributes, e.g., character frequency, position, etc. IGTR first devises $\left \langle condition,question,answer\right \rangle$ instruction triplets, providing rich and diverse descriptions of character attributes. To effectively learn these attributes through question-answering, IGTR develops lightweight instruction encoder, cross-modal feature fusion module and multi-task answer head, which guides nuanced text image understanding. Furthermore, IGTR realizes different recognition pipelines simply by using different instructions, enabling a character-understanding-based text reasoning paradigm that considerably differs from current methods. Experiments on English and Chinese benchmarks show that IGTR outperforms existing models by significant margins, while maintaining a small model size and efficient inference speed. Moreover, by adjusting the sampling of instructions, IGTR offers an elegant way to tackle the recognition of both rarely appearing and morphologically similar characters, which were previous challenges. Code at \href{https://github.com/Topdu/OpenOCR}{this http URL}.

翻译：近年来，多模态模型在视觉识别任务中展现出引人注目的性能，因为自由形式的文本引导训练激发了理解细粒度视觉内容的能力。然而，由于自然图像与文本图像在构成上的差异，现有模型要么效率低下，要么无法直接升级用于场景文本识别任务。本文提出了一种新颖的指令引导场景文本识别范式，将场景文本识别构建为一个指令学习问题，通过预测字符属性来理解文本图像。该范式首先设计 $\left \langle 条件,问题,答案\right \rangle$ 指令三元组，为字符属性提供丰富多样的描述。为了通过问答有效学习这些属性，该范式开发了轻量级指令编码器、跨模态特征融合模块和多任务答案头，以指导对文本图像的细致理解。此外，该范式仅通过使用不同的指令即可实现不同的识别流程，从而建立了一种基于字符理解的文本推理范式，与现有方法有显著区别。在英文和中文基准测试上的实验表明，该范式以显著优势超越现有模型，同时保持较小的模型规模和高效的推理速度。此外，通过调整指令采样，该范式提供了一种优雅的方法来解决罕见字符和形态相似字符的识别难题。代码发布于 \href{https://github.com/Topdu/OpenOCR}{此网址}。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/