A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GD&T symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GD&T frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.

翻译：工程图纸是制造沟通的基础，作为传达设计意图、公差和生产细节的主要媒介。然而，由于布局多变、方向各异以及符号与文本内容混合等特点，使用人工方法、通用光学字符识别系统或传统深度学习方法来解析具有密集标注的复杂多视图图纸仍然具有挑战性。为解决这些问题，本文提出了一种利用现代检测与视觉语言模型进行二维多视图工程图自动解析的三阶段混合框架。第一阶段，YOLOv11-det执行布局分割，以定位视图、标题栏和注释等关键区域。第二阶段，使用YOLOv11-obb进行面向方向的细粒度标注检测，包括尺寸、几何尺寸与公差符号以及表面粗糙度指示符。第三阶段采用两个基于Donut的无OCR视觉语言模型进行语义内容解析：字母型VLM从标题栏和注释中提取文本和分类信息，而数值型VLM则解析尺寸、几何尺寸与公差框以及表面粗糙度等定量数据。为确保持续性和泛化能力，开发了两个专用数据集：1,000张图纸用于布局检测，1,406张用于标注级训练。字母型VLM实现了0.672的整体F1分数，而数值型VLM达到了0.963，分别在文本和定量解析方面表现出色。统一的JSON输出实现了与CAD和制造数据库的无缝集成，为智能工程图分析提供了可扩展的解决方案。

相关内容

Engineering

关注 6

《工程》是中国工程院（CAE）于2015年推出的国际开放存取期刊。其目的是提供一个高水平的平台，传播和分享工程研发的前沿进展、当前主要研究成果和关键成果；报告工程科学的进展，讨论工程发展的热点、兴趣领域、挑战和前景，在工程中考虑人与环境的福祉和伦理道德，鼓励具有深远经济和社会意义的工程突破和创新，使之达到国际先进水平，成为新的生产力，从而改变世界，造福人类，创造新的未来。期刊链接：https://www.sciencedirect.com/journal/engineering