An Industrial-Scale Retrieval-Augmented Generation Framework for Requirements Engineering: Empirical Evaluation with Automotive Manufacturing Data

Requirements engineering in Industry 4.0 faces critical challenges with heterogeneous, unstructured documentation spanning technical specifications, supplier lists, and compliance standards. While retrieval-augmented generation (RAG) shows promise for knowledge-intensive tasks, no prior work has evaluated RAG on authentic industrial RE workflows using comprehensive production-grade performance metrics. This paper presents a comprehensive empirical evaluation of RAG for industrial requirements engineering automation using authentic automotive manufacturing documentation comprising 669 requirements across four specification standards (MBN 9666-1, MBN 9666-2, BQF 9666-5, MBN 9666-9) spanning 2015-2023, plus 49 supplier qualifications with extensive supporting documentation. Through controlled comparisons with BERT-based and ungrounded LLM approaches, the framework achieves 98.2% extraction accuracy with complete traceability, outperforming baselines by 24.4% and 19.6%, respectively. Hybrid semantic-lexical retrieval achieves MRR of 0.847. Expert quality assessment averaged 4.32/5.0 across five dimensions. The evaluation demonstrates 83% reduction in manual analysis time and 47% cost savings through multi-provider LLM orchestration. Ablation studies quantify individual component contributions. Longitudinal analysis reveals a 55% reduction in requirement volume coupled with 1,800% increase in IT security focus, identifying 10 legacy suppliers (20.4%) requiring requalification, representing potential $2.3M in avoided contract penalties.

翻译：工业4.0中的需求工程面临异构、非结构化文档的关键挑战，这些文档涵盖技术规格、供应商清单和合规标准。虽然检索增强生成（RAG）在知识密集型任务中展现出潜力，但此前尚无研究使用全面的生产级性能指标对RAG在真实工业需求工程工作流中的表现进行评估。本文基于包含669条需求（涵盖MBN 9666-1、MBN 9666-2、BQF 9666-5和MBN 9666-9四项规格标准，时间跨度为2015-2023年）以及49个供应商资质及其详细支持文档的真实汽车制造数据，对RAG在工业需求工程自动化中的应用进行了全面的实证评估。通过与基于BERT和无基础的大型语言模型方法进行对照比较，该框架实现了98.2%的提取准确率与完整可追溯性，分别超越基线方法24.4%和19.6%。混合语义-词汇检索的平均倒数排名（MRR）达到0.847。专家质量评估在五个维度上平均得分为4.32/5.0。评估表明，通过多提供商大型语言模型编排，手动分析时间减少83%，成本节省47%。消融研究量化了各组件的贡献。纵向分析显示需求规模减少55%，同时IT安全关注度增加1,800%，识别出10个（占比20.4%）需要重新认证的遗留供应商，避免了约230万美元的合同罚款。