Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process

Sandipp Krishnan Ravi,Yigitcan Comlek,Wei Chen,Arjun Pathak,Vipul Gupta,Rajnikant Umretiya,Andrew Hoffman,Ghanshyam Pilania,Piyush Pandita,Sayan Ghosh,Nathaniel Mckeever,Liping Wang

from arxiv, 27 Pages,9 Figures, 3 Supplementary Figures, 2 Supplementary Tables

With the advent of artificial intelligence (AI) and machine learning (ML), various domains of science and engineering communites has leveraged data-driven surrogates to model complex systems from numerous sources of information (data). The proliferation has led to significant reduction in cost and time involved in development of superior systems designed to perform specific functionalities. A high proposition of such surrogates are built extensively fusing multiple sources of data, may it be published papers, patents, open repositories, or other resources. However, not much attention has been paid to the differences in quality and comprehensiveness of the known and unknown underlying physical parameters of the information sources that could have downstream implications during system optimization. Towards resolving this issue, a multi-source data fusion framework based on Latent Variable Gaussian Process (LVGP) is proposed. The individual data sources are tagged as a characteristic categorical variable that are mapped into a physically interpretable latent space, allowing the development of source-aware data fusion modeling. Additionally, a dissimilarity metric based on the latent variables of LVGP is introduced to study and understand the differences in the sources of data. The proposed approach is demonstrated on and analyzed through two mathematical (representative parabola problem, 2D Ackley function) and two materials science (design of FeCrAl and SmCoFe alloys) case studies. From the case studies, it is observed that compared to using single-source and source unaware ML models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems, interpretability regarding the sources, and enhanced modeling capabilities by taking advantage of the correlations and relationships among different sources.

翻译：随着人工智能和机器学习的发展，科学与工程领域的各个分支已利用数据驱动的代理模型，从众多信息源（数据）对复杂系统进行建模。这种发展显著降低了开发具备特定功能的高级系统所需的成本和时间。大多数此类代理模型是通过融合多种数据源构建的，这些数据源可能包括已发表的论文、专利、开放存储库或其他资源。然而，对于信息源中已知和未知底层物理参数在质量和全面性上的差异（这些差异可能在系统优化过程中产生下游影响），人们并未给予足够关注。为解决这一问题，本文提出了一种基于潜在变量高斯过程的多源数据融合框架。各数据源被标记为特征分类变量，并将其映射到物理可解释的潜在空间中，从而实现了具有源感知能力的数据融合建模。此外，引入了一种基于LVGP潜在变量的相异度度量，用于研究和理解数据源之间的差异。通过两个数学案例（代表性抛物线问题、二维Ackley函数）和两个材料科学案例（FeCrAl和SmCoFe合金设计）对所提出的方法进行了展示和分析。从案例研究中观察到，与使用单源模型和源不可知机器学习模型相比，所提出的多源数据融合框架能够利用不同数据源之间的相关性和关系，为稀疏数据问题提供更优的预测、关于数据源的可解释性以及增强的建模能力。