Optical Character Recognition (OCR) technology is widely used to extract text from images of documents, facilitating efficient digitization and data retrieval. However, merely extracting text is insufficient when dealing with complex documents. Fully comprehending such documents requires an understanding of their structure -- including formatting, formulas, tables, and the reading order of multiple blocks and columns across multiple pages -- as well as semantic information for detecting elements like footnotes and image captions. This comprehensive understanding is crucial for downstream tasks such as retrieval, document question answering, and data curation for training Large Language Models (LLMs) and Vision Language Models (VLMs). To address this, we introduce \'Eclair, a general-purpose text-extraction tool specifically designed to process a wide range of document types. Given an image, \'Eclair is able to extract formatted text in reading order, along with bounding boxes and their corresponding semantic classes. To thoroughly evaluate these novel capabilities, we introduce our diverse human-annotated benchmark for document-level OCR and semantic classification. \'Eclair achieves state-of-the-art accuracy on this benchmark, outperforming other methods across key metrics. Additionally, we evaluate \'Eclair on established benchmarks, demonstrating its versatility and strength across several evaluation standards.
翻译:光学字符识别(OCR)技术被广泛用于从文档图像中提取文本,以促进高效的数字化与数据检索。然而,在处理复杂文档时,仅提取文本是不够的。要完全理解此类文档,需要理解其结构——包括格式、公式、表格以及跨多页的多个区块与栏的阅读顺序——以及用于检测脚注和图像标题等元素的语义信息。这种全面的理解对于下游任务至关重要,例如检索、文档问答以及用于训练大语言模型(LLMs)和视觉语言模型(VLMs)的数据整理。为此,我们介绍了 \'Eclair,一个专门设计用于处理广泛文档类型的通用文本提取工具。给定一张图像,\'Eclair 能够按照阅读顺序提取带格式的文本,同时提供边界框及其对应的语义类别。为了全面评估这些新颖的能力,我们引入了我们多样化的人工标注基准,用于文档级 OCR 和语义分类。\'Eclair 在该基准上达到了最先进的准确率,在关键指标上优于其他方法。此外,我们在已有基准上评估了 \'Eclair,证明了其在多个评估标准下的通用性和强大性能。