An information-matching approach to optimal experimental design and active learning

Yonatan Kurniawan,Tracianne B. Neilsen,Benjamin L. Francis,Alex M. Stankovic,Mingjian Wen,Ilia Nikiforov,Ellad B. Tadmor,Vasily V. Bulatov,Vincenzo Lordi,Mark K. Transtrum

The efficacy of mathematical models heavily depends on the quality of the training data, yet collecting sufficient data is often expensive and challenging. Many modeling applications require inferring parameters only as a means to predict other quantities of interest (QoI). Because models often contain many unidentifiable (sloppy) parameters, QoIs often depend on a relatively small number of parameter combinations. Therefore, we introduce an information-matching criterion based on the Fisher Information Matrix to select the most informative training data from a candidate pool. This method ensures that the selected data contain sufficient information to learn only those parameters that are needed to constrain downstream QoIs. It is formulated as a convex optimization problem, making it scalable to large models and datasets. We demonstrate the effectiveness of this approach across various modeling problems in diverse scientific fields, including power systems and underwater acoustics. Finally, we use information-matching as a query function within an Active Learning loop for material science applications. In all these applications, we find that a relatively small set of optimal training data can provide the necessary information for achieving precise predictions. These results are encouraging for diverse future applications, particularly active learning in large machine learning models.

翻译：数学模型的效能很大程度上依赖于训练数据的质量，然而收集足够的数据通常既昂贵又具有挑战性。许多建模应用仅需推断参数作为预测其他关注量（QoI）的手段。由于模型通常包含许多不可识别（松散）参数，而QoI往往仅依赖于相对较少的参数组合。因此，我们提出一种基于费舍尔信息矩阵的信息匹配准则，用于从候选池中选择信息量最大的训练数据。该方法确保所选数据包含足够信息，仅用于学习约束下游QoI所需的参数。该准则被构建为凸优化问题，使其能够扩展到大型模型和数据集。我们在包括电力系统和海洋声学在内的多个科学领域的建模问题中验证了该方法的有效性。最后，我们将信息匹配作为主动学习循环中的查询函数应用于材料科学领域。在所有应用中，我们发现相对较小的一组最优训练数据即可为实现精确预测提供必要信息。这些结果为未来多样化应用，特别是大型机器学习模型中的主动学习，提供了令人鼓舞的前景。

相关内容

主动学习

关注 243

主动学习是机器学习（更普遍的说是人工智能）的一个子领域，在统计学领域也叫查询学习、最优实验设计。“学习模块”和“选择策略”是主动学习算法的2个基本且重要的模块。主动学习是“一种学习方法，在这种方法中，学生会主动或体验性地参与学习过程，并且根据学生的参与程度，有不同程度的主动学习。” （Bonwell＆Eison 1991）Bonwell＆Eison（1991）指出：“学生除了被动地听课以外，还从事其他活动。” 在高等教育研究协会（ASHE）的一份报告中，作者讨论了各种促进主动学习的方法。他们引用了一些文献，这些文献表明学生不仅要做听，还必须做更多的事情才能学习。他们必须阅读，写作，讨论并参与解决问题。此过程涉及三个学习领域，即知识，技能和态度（KSA）。这种学习行为分类法可以被认为是“学习过程的目标”。特别是，学生必须从事诸如分析，综合和评估之类的高级思维任务。

【博士论文】推进数据高效的深度学习：非参数 Transformer、主动测试与上下文学习

专知会员服务

25+阅读 · 2025年8月7日

【普林斯顿博士论文】监督学习与强化学习中的元学习分析

专知会员服务

24+阅读 · 2025年7月1日

北科大最新《分布变化下的图学习》综述，详述领域适应、非分布和持续学习进展

专知会员服务

45+阅读 · 2024年2月27日

【剑桥大学博士论文】《脑科学中的数据驱动表示：基因表达和神经成像领域的建模方法》2022最新160页论文

专知会员服务

41+阅读 · 2022年8月28日