Large Language Models (LLMs) are increasingly entering specialized, safety-critical engineering workflows governed by strict quantitative standards and immutable physical laws, making rigorous evaluation of their reasoning capabilities imperative. However, existing benchmarks such as MMLU, MATH, and HumanEval assess isolated cognitive skills, failing to capture the physically grounded reasoning central to engineering, where scientific principles, quantitative modeling, and practical constraints must converge. To enable verifiable process supervision in engineering, we introduce EngTrace, a symbolic benchmark built on 90 parameterized templates, each generating unique, contamination-resistant problem instances, spanning three major engineering branches, nine core domains, and 20 distinct areas, yielding 1,350 test cases that stress-test generalization across diverse physical scenarios. Moving beyond outcome matching, we introduce a verifiable two-stage evaluation framework that uses a tiered protocol to validate intermediate reasoning traces alongside final answers through automated procedural checks and a heterogeneous AI Tribunal. Our evaluation of 27 leading LLMs reveals a distinct trade-off between numeric precision and trace fidelity, identifying a complexity cliff where abstract mathematical pre-training fails to translate into the integrative reasoning required for advanced engineering tasks.
翻译:大语言模型(LLMs)正日益进入受严格量化标准和不可违背物理定律约束的专业安全关键工程工作流程,对其推理能力进行严谨评估势在必行。然而,现有基准如MMLU、MATH和HumanEval仅评估孤立的认知技能,未能捕捉工程领域核心的物理基础推理——其中科学原理、定量建模和实际约束必须相互融合。为在工程中实现可验证的过程监督,我们提出EngTrace——一个基于90个参数化模板构建的符号基准,每个模板可生成独特、抗污染的问题实例,涵盖三大工程分支、九个核心领域和20个不同方向,最终产生1,350个测试用例,用于压力测试模型在多样化物理场景中的泛化能力。超越传统的结果匹配方法,我们引入可验证的两阶段评估框架,通过分层协议验证中间推理轨迹与最终答案,并采用自动化过程检查与异构AI审裁组相结合的方式。对27个前沿LLM的评估揭示了数值精度与轨迹保真度之间的显著权衡,并识别出一道"复杂性悬崖"——即抽象的数学预训练无法转化为高级工程任务所需的综合推理能力。