Large models have recently played a dominant role in natural language processing and multimodal vision-language learning. It remains less explored about their efficacy in text-related visual tasks. We conducted a comprehensive study of existing publicly available multimodal models, evaluating their performance in text recognition (document text, artistic text, handwritten text, scene text), text-based visual question answering (document text, scene text, and bilingual text), key information extraction (receipts, documents, and nutrition facts) and handwritten mathematical expression recognition. Our findings reveal strengths and weaknesses in these models, which primarily rely on semantic understanding for word recognition and exhibit inferior perception of individual character shapes. They also display indifference towards text length and have limited capabilities in detecting finegrained features in images. Consequently, these results demonstrate that even the current most powerful large multimodal models cannot match domain-specific methods in traditional text tasks and face greater challenges in more complex tasks. Most importantly, the baseline results showcased in this study could provide a foundational framework for the conception and assessment of innovative strategies targeted at enhancing zero-shot multimodal techniques. Evaluation pipeline is available at https://github.com/Yuliang-Liu/MultimodalOCR.
翻译:近期,大型模型在自然语言处理和多模态视觉语言学习领域占据主导地位,但其在文本相关视觉任务中的效能仍有待深入探索。我们对现有公开可用的多模态模型进行了全面研究,评估了它们在文本识别(文档文本、艺术文本、手写文本、场景文本)、基于文本的视觉问答(文档文本、场景文本及双语文本)、关键信息提取(收据、文档、营养成分表)以及手写数学表达式识别任务上的表现。研究结果揭示了这些模型的优势与不足:它们主要依赖语义理解进行单词识别,对单个字符形状的感知能力较差;对文本长度表现出不敏感性,且检测图像中细粒度特征的能力有限。这些结果证明,即便是当前最强大的大型多模态模型,在传统文本任务中也无法匹敌领域特定方法,并且在更复杂的任务中面临更大挑战。最重要的是,本研究展示的基线结果可为创新性零样本多模态技术策略的构思与评估提供基础框架。评估流程见 https://github.com/Yuliang-Liu/MultimodalOCR。