Domain-Validity-Gated Metamorphic Testing of Scientific ML Surrogates

Scientific machine-learning (SciML) surrogates approximate expensive simulations, but exact expected outputs for arbitrary inputs are unavailable (the oracle problem). Metamorphic testing checks relations across executions, yet a candidate relation is not automatically valid: its preconditions, output mapping, and the numerical floor of the scoring operator determine whether a violation is meaningful. We study how candidate metamorphic relations (MRs) can be screened for domain validity and turned into executable, oracle-free test assets for SciML surrogates. We propose (i) a domain-validity rubric that admits a candidate only when its tolerance dominates the operator's numerical floor and its preconditions hold; (ii) an MR-card executable-asset format recording source cases, transformations, metrics, tolerances, and typed relation-level verdicts; and (iii) a case-study protocol on MeshGraphNets cylinder-flow surrogates, with a claim ledger binding every result to a tracked artifact. On a MeshGraphNets checkpoint, node permutation holds to machine precision, mirror-y is a bounded out-of-distribution stress finding rather than an exact symmetry, and absolute conservation stays deferred while a reference-relative guard passes. The same readings hold across held-out trajectories, a checkpoint roster, three further architectures, and PhysicsNeMo. On a second CFD task (compressible airfoil) the predicate instead rejects incompressible continuity on physical grounds, showing it reasons about domain validity rather than running a fixed checklist. On a second PDE family, FNO Burgers and heat surrogates run full admit/reject/execute verdicts. The evidence spans two CFD tasks and a second PDE family, supporting a validity-aware bridge from candidate MRs to auditable SciML test assets that separates model-level violations from out-of-domain applications.

翻译：科学机器学习（SciML）代理模型能够近似昂贵的仿真过程，但对于任意输入的精确期望输出通常无法获取（即预言难题）。元变形测试通过检查跨执行过程的关系来验证模型，然而候选关系并非自动有效：其前置条件、输出映射以及评分算子的数值底线共同决定了违规行为是否具有意义。本研究探索如何筛选候选元变形关系（MRs）的领域有效性，并将其转化为可执行、免于预言问题的SciML代理模型测试资产。我们提出：（i）领域有效性评估准则——仅当候选关系的容差主导算子数值底线且其前置条件成立时才予以接纳；（ii）MR卡片可执行资产格式——记录源案例、变换方法、评价指标、容差参数及类型化关系级别判决结果；（iii）基于MeshGraphNets圆柱绕流代理模型的案例研究协议——通过声明账本将每个结果与可追踪构件绑定。在MeshGraphNets检查点上发现：节点置换保持机器精度，镜像对称性表现为有界分布外应力测试而非精确对称性，绝对守恒性在参考相对守卫通过时保持延迟状态。同样的结论在保留轨迹、检查点集、三种额外架构及PhysicsNeMo框架中均成立。在第二个CFD任务（可压缩翼型）中，该谓词从物理依据上拒绝了不可压缩连续性假设，展现其对领域有效性的推理能力而非执行固定检查清单。在第二个PDE族（FNO Burgers方程与热方程代理模型）中，完整的接纳/拒绝/执行判决流程得以实施。本研究的证据覆盖两个CFD任务及第二个PDE族，支持从候选MR到可审计SciML测试资产的领域感知桥梁构建，实现模型级违规与域外应用的分离。