Scientific machine learning is limited less by model size than by the data it is trained on. Observational data records what happened but not why; template synthetic data has a known generating process but only for the simulator's template, not the case a user faces. We argue a third option is now operationally feasible: instrumented data, in which every datum carries the mechanistic model that produced it, an explicit uncertainty over that model, and an executable family of counterfactuals. Verification-and-validation (V&V) instrumented image-to-simulation pipelines are one realisation: a sensor observation becomes a fully specified, solver-backed simulation with explicit, editable parameters and a propagated aleatoric/epistemic uncertainty. The substrate is case-specific, mechanistically supervised, and supports causal interventions through Pearl's do-operator. Near-term consequences for validation, auditing, and surrogate training span computational biology, climate, materials, fluid mechanics, and medical imaging; a longer-term, falsifiable implication concerns foundation models for scientific reasoning.
翻译:科学机器学习的瓶颈更在于训练数据而非模型规模。观测性数据记录了事件发生的过程却未揭示其成因;模板化合成数据虽具备已知的生成机制,但这种机制仅适用于模拟器的模板框架,而非用户实际面对的特定场景。我们论证第三种方案现已具备操作可行性:仪器化数据,其中每个数据点承载着生成该数据的机理模型、对该模型显式的不确定性度量,以及可执行的反事实族。验证与确认(V&V)仪器化图像-模拟管线正是该方案的一种实现:传感器观测数据转化为具有完整参数化描述、求解器支撑且可显式编辑的仿真模型,并附带传播的随机性/认知性不确定性。这种数据基底具有案例特异性、受机理监督,并支持通过Pearl的do-算子实施因果干预。其在验证、审计和代理模型训练方面的近期应用涵盖计算生物学、气候科学、材料科学、流体力学和医学成像领域;而一项可证伪的长期推论则涉及科学推理的基础模型。