This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Specifically, it showed limitations when dealing with non-Latin languages and complex tasks such as handwriting mathematical expression recognition, table structure recognition, and end-to-end semantic entity recognition and pair extraction from document image. Based on these observations, we affirm the necessity and continued research value of specialized OCR models. In general, despite its versatility in handling diverse OCR tasks, GPT-4V does not outperform existing state-of-the-art OCR models. How to fully utilize pre-trained general-purpose LMMs such as GPT-4V for OCR downstream tasks remains an open problem. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at https://github.com/SCUT-DLVCLab/GPT-4V_OCR.
翻译:本文对近期发布的大型多模态模型GPT-4V(ision)的光学字符识别能力进行了全面评估。我们评估了该模型在多种OCR任务上的表现,包括场景文本识别、手写文本识别、手写数学表达式识别、表格结构识别以及从视觉丰富文档中提取信息。评估结果显示,GPT-4V在识别和理解拉丁语内容方面表现良好,但在多语言场景和复杂任务中遇到困难。具体而言,它在处理非拉丁语言以及复杂任务(如手写数学表达式识别、表格结构识别,以及从文档图像中进行的端到端语义实体识别与配对提取)时表现出局限性。基于这些观察,我们确认了专用OCR模型的必要性及其持续研究价值。总体而言,尽管GPT-4V在处理多样OCR任务方面具有通用性,但其性能并未超越现有的最先进OCR模型。如何充分利用像GPT-4V这样的预训练通用大型多模态模型进行OCR下游任务,仍是一个开放问题。本研究为未来结合大型多模态模型的OCR研究提供了重要参考。评估流程与结果可在https://github.com/SCUT-DLVCLab/GPT-4V_OCR获取。