Industrial projects rely heavily on lengthy, complex specification documents, making tedious manual extraction of structured information a major bottleneck. This paper introduces an innovative approach to automate this process, leveraging the capabilities of two cutting-edge AI models: Donut, a model that extracts information directly from scanned documents without OCR, and OpenAI GPT-3.5 Turbo, a robust large language model. The proposed methodology is initiated by acquiring the table of contents (ToCs) from construction specification documents and subsequently structuring the ToCs text into JSON data. Remarkable accuracy is achieved, with Donut reaching 85% and GPT-3.5 Turbo reaching 89% in effectively organizing the ToCs. This landmark achievement represents a significant leap forward in document indexing, demonstrating the immense potential of AI to automate information extraction tasks across diverse document types, boosting efficiency and liberating critical resources in various industries.
翻译:工业项目严重依赖冗长复杂的规范文档,使得手动提取结构化信息成为主要瓶颈。本文提出一种创新方法来自动化该流程,利用两个前沿AI模型的能力:无需OCR即可直接从扫描文档中提取信息的Donut模型,以及强大的大语言模型OpenAI GPT-3.5 Turbo。所提出的方法论首先从施工规范文档中获取目录,随后将目录文本结构化处理为JSON数据。实验取得了显著精度:Donut在有效整理目录方面达到85%的准确率,而GPT-3.5 Turbo达到89%的准确率。这一里程碑式成果代表了文档索引领域的重大突破,展现了AI在跨多种文档类型自动化信息提取任务中的巨大潜力,能够提升效率并释放各行业的关键资源。