EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

Large Language Model (LLM) agents are increasingly applied to engineering design tasks, yet existing evaluation frameworks do not adequately address multi-agent systems that combine simulation, retrieval, and manufacturing preparation. We introduce a benchmark suite with three evaluation dimensions: (1) a workflow benchmark with seven prompt styles targeting distinct cognitive demands-including direct tool use, semantic disambiguation, conditional branching, and working-memory tasks; (2) a Retrieval-Augmented Generation (RAG) benchmark with gated scoring isolating retrieval contributions to parameter selection; and (3) an High Performance Computing (HPC) benchmark evaluating end-to-end ML training orchestration on a SLURM cluster. Alongside the benchmark we present EngiAI, a Multi-Agent System (MAS) reference implementation built on LangGraph that operationalizes the benchmark by coordinating seven specialized agents through a supervisor architecture, unifying topology optimization, document retrieval, HPC job orchestration, and 3D printer control. Across four LLM backends and two EngiBench problems, proprietary models achieve 96-97% average task completion on Beams2D, while open-source 4B-parameter models reach 55-78%, with clear generational improvement. Conditional branching proves most challenging, with task completion dropping to 20-53% for the conditional style on Photonics2D. RAG gating confirms near-perfect retrieval-augmented scores (about 1.0) versus near-zero without retrieval, validating the evaluation design. On HPC orchestration, one model completes all pipeline steps in 100% of runs while another drops to 50%, revealing that multi-step instruction following degrades over long-running workflows.

翻译：大语言模型（LLM）智能体日益应用于工程设计任务，然而现有评估框架未能充分涵盖结合仿真、检索与制造准备的多智能体系统。我们提出了一个包含三个评估维度的基准测试套件：（1）工作流基准测试，包含七种针对不同认知需求的提示风格——包括直接工具使用、语义消歧、条件分支与工作记忆任务；（2）检索增强生成（RAG）基准测试，通过门控评分独立量化检索对参数选择的贡献；（3）高性能计算（HPC）基准测试，评估在SLURM集群上端到端机器学习训练编排的能力。伴随基准测试，我们提出EngiAI——基于LangGraph构建的多智能体系统（MAS）参考实现，通过监督架构协调七个专用智能体，整合拓扑优化、文档检索、HPC作业编排与3D打印机控制，从而将基准测试付诸实践。在四种LLM后端与两项EngiBench问题上的实验表明：闭源模型在Beams2D任务上的平均完成率达96-97%，而开源4B参数模型达55-78%，呈现明显的代际提升。条件分支最具挑战性：Photonics2D任务中该风格的完成率降至20-53%。RAG门控验证了近乎完美的检索增强得分（约1.0），而无检索时得分趋近于零，证实了评估设计的有效性。在HPC编排中，某模型在100%的运行中完成所有流水线步骤，而另一模型降至50%，揭示了多步指令遵循能力在长周期工作流中的退化现象。