Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process

Sandipp Krishnan Ravi,Yigitcan Comlek,Wei Chen,Arjun Pathak,Vipul Gupta,Rajnikant Umretiya,Andrew Hoffman,Ghanshyam Pilania,Piyush Pandita,Sayan Ghosh,Nathaniel Mckeever,Liping Wang

from arxiv, 27 Pages,10 Figures, 3 Supplementary Figures, 2 Supplementary Tables

With the advent of artificial intelligence (AI) and machine learning (ML), various domains of science and engineering communites has leveraged data-driven surrogates to model complex systems from numerous sources of information (data). The proliferation has led to significant reduction in cost and time involved in development of superior systems designed to perform specific functionalities. A high proposition of such surrogates are built extensively fusing multiple sources of data, may it be published papers, patents, open repositories, or other resources. However, not much attention has been paid to the differences in quality and comprehensiveness of the known and unknown underlying physical parameters of the information sources that could have downstream implications during system optimization. Towards resolving this issue, a multi-source data fusion framework based on Latent Variable Gaussian Process (LVGP) is proposed. The individual data sources are tagged as a characteristic categorical variable that are mapped into a physically interpretable latent space, allowing the development of source-aware data fusion modeling. Additionally, a dissimilarity metric based on the latent variables of LVGP is introduced to study and understand the differences in the sources of data. The proposed approach is demonstrated on and analyzed through two mathematical (representative parabola problem, 2D Ackley function) and two materials science (design of FeCrAl and SmCoFe alloys) case studies. From the case studies, it is observed that compared to using single-source and source unaware ML models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems, interpretability regarding the sources, and enhanced modeling capabilities by taking advantage of the correlations and relationships among different sources.

翻译：随着人工智能（AI）和机器学习（ML）的发展，科学与工程领域的多个学科已广泛利用数据驱动的代理模型，通过多源信息（数据）对复杂系统进行建模。这种方法的普及显著降低了开发具有特定功能的先进系统所需的成本和时间。大量此类代理模型通过深度融合多源数据构建而成，数据源可能包括已发表的论文、专利、开放存储库或其他资源。然而，信息源中已知和未知基础物理参数在质量与完备性上的差异，可能对系统优化产生后续影响，这一问题尚未得到充分关注。为解决此问题，本文提出一种基于隐变量高斯过程（LVGP）的多源数据融合框架。各数据源被标记为特征分类变量，并映射到物理可解释的隐空间，从而建立源感知的数据融合模型。此外，引入基于LVGP隐变量的相异性度量，以研究和理解数据源之间的差异。通过两个数学案例（代表性抛物线问题、二维Ackley函数）和两个材料科学案例（FeCrAl与SmCoFe合金设计）对所提方法进行了验证与分析。案例研究表明：相较于使用单源及无源感知的ML模型，所提出的多源数据融合框架能够为稀疏数据问题提供更优的预测性能，实现对数据源的可解释性，并通过利用不同源间的关联性增强建模能力。