A Multi-Stage Hybrid Framework for Automated Interpretation of Multi-View Engineering Drawings Using Vision Language Model

Engineering drawings are fundamental to manufacturing communication, serving as the primary medium for conveying design intent, tolerances, and production details. However, interpreting complex multi-view drawings with dense annotations remains challenging using manual methods, generic optical character recognition (OCR) systems, or traditional deep learning approaches, due to varied layouts, orientations, and mixed symbolic-textual content. To address these challenges, this paper proposes a three-stage hybrid framework for the automated interpretation of 2D multi-view engineering drawings using modern detection and vision language models (VLMs). In the first stage, YOLOv11-det performs layout segmentation to localize key regions such as views, title blocks, and notes. The second stage uses YOLOv11-obb for orientation-aware, fine-grained detection of annotations, including measures, GD&T symbols, and surface roughness indicators. The third stage employs two Donut-based, OCR-free VLMs for semantic content parsing: the Alphabetical VLM extracts textual and categorical information from title blocks and notes, while the Numerical VLM interprets quantitative data such as measures, GD&T frames, and surface roughness. Two specialized datasets were developed to ensure robustness and generalization: 1,000 drawings for layout detection and 1,406 for annotation-level training. The Alphabetical VLM achieved an overall F1 score of 0.672, while the Numerical VLM reached 0.963, demonstrating strong performance in textual and quantitative interpretation, respectively. The unified JSON output enables seamless integration with CAD and manufacturing databases, providing a scalable solution for intelligent engineering drawing analysis.

翻译：工程图是制造沟通的基础，作为传达设计意图、公差和生产细节的主要媒介。然而，由于布局多变、方向各异以及符号与文本内容混杂，使用人工方法、通用光学字符识别（OCR）系统或传统深度学习方法来解析具有密集标注的复杂多视图图纸仍然具有挑战性。为应对这些挑战，本文提出了一种利用现代检测与视觉语言模型（VLMs）进行二维多视图工程图自动解析的三阶段混合框架。第一阶段，YOLOv11-det 执行布局分割以定位关键区域，如视图、标题栏和注释。第二阶段使用 YOLOv11-obb 进行面向方向的细粒度标注检测，包括尺寸、GD&T 符号和表面粗糙度指示符。第三阶段采用两个基于 Donut 的、无需 OCR 的 VLM 进行语义内容解析：字母型 VLM 从标题栏和注释中提取文本和分类信息，而数值型 VLM 则解析定量数据，如尺寸、GD&T 框和表面粗糙度。为确保鲁棒性和泛化能力，开发了两个专用数据集：1,000 张图纸用于布局检测，1,406 张用于标注级训练。字母型 VLM 的整体 F1 分数达到 0.672，而数值型 VLM 达到 0.963，分别在文本和定量解析方面表现出色。统一的 JSON 输出实现了与 CAD 和制造数据库的无缝集成，为智能工程图分析提供了一个可扩展的解决方案。