This paper presents a comprehensive evaluation of the Optical Character Recognition (OCR) capabilities of the recently released GPT-4V(ision), a Large Multimodal Model (LMM). We assess the model's performance across a range of OCR tasks, including scene text recognition, handwritten text recognition, handwritten mathematical expression recognition, table structure recognition, and information extraction from visually-rich document. The evaluation reveals that GPT-4V performs well in recognizing and understanding Latin contents, but struggles with multilingual scenarios and complex tasks. Based on these observations, we delve deeper into the necessity of specialized OCR models and deliberate on the strategies to fully harness the pretrained general LMMs like GPT-4V for OCR downstream tasks. The study offers a critical reference for future research in OCR with LMMs. Evaluation pipeline and results are available at https://github.com/SCUT-DLVCLab/GPT-4V_OCR.
翻译:本文对最新发布的大型多模态模型(LMM)GPT-4V(ision)的光学字符识别(OCR)能力进行了全面评估。我们从场景文字识别、手写文字识别、手写数学表达式识别、表格结构识别以及从丰富视觉文档中提取信息等多个OCR任务维度,评估了该模型的性能。评估结果显示,GPT-4V在识别和理解拉丁语系内容方面表现良好,但在处理多语言场景和复杂任务时存在困难。基于这些观察,我们深入探讨了专用OCR模型的必要性,并论证了如何充分利用GPT-4V等预训练通用LMM进行OCR下游任务的策略。本研究为未来基于LMM的OCR研究提供了重要参考。评估流程与结果可在 https://github.com/SCUT-DLVCLab/GPT-4V_OCR 获取。