The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.
翻译:数学模型的效能很大程度上依赖于训练数据的质量,然而收集足够的数据通常既昂贵又具有挑战性。许多建模应用仅需推断参数作为预测其他关注量(QoI)的手段。由于模型通常包含许多不可识别(松散)参数,而QoI往往仅依赖于相对较少的参数组合。因此,我们提出一种基于费舍尔信息矩阵的信息匹配准则,用于从候选池中选择信息量最大的训练数据。该方法确保所选数据包含足够信息,仅用于学习约束下游QoI所需的参数。该准则被构建为凸优化问题,使其能够扩展到大型模型和数据集。我们在包括电力系统和海洋声学在内的多个科学领域的建模问题中验证了该方法的有效性。最后,我们将信息匹配作为主动学习循环中的查询函数应用于材料科学领域。在所有应用中,我们发现相对较小的一组最优训练数据即可为实现精确预测提供必要信息。这些结果为未来多样化应用,特别是大型机器学习模型中的主动学习,提供了令人鼓舞的前景。