Interpretable Multi-Source Data Fusion Through Latent Variable Gaussian Process

Sandipp Krishnan Ravi,Yigitcan Comlek,Wei Chen,Arjun Pathak,Vipul Gupta,Rajnikant Umretiya,Andrew Hoffman,Ghanshyam Pilania,Piyush Pandita,Sayan Ghosh,Nathaniel Mckeever,Liping Wang

from arxiv, 27 Pages,9 Figures, 3 Supplementary Figures, 2 Supplementary Tables

With the advent of artificial intelligence (AI) and machine learning (ML), various domains of science and engineering communites has leveraged data-driven surrogates to model complex systems from numerous sources of information (data). The proliferation has led to significant reduction in cost and time involved in development of superior systems designed to perform specific functionalities. A high proposition of such surrogates are built extensively fusing multiple sources of data, may it be published papers, patents, open repositories, or other resources. However, not much attention has been paid to the differences in quality and comprehensiveness of the known and unknown underlying physical parameters of the information sources that could have downstream implications during system optimization. Towards resolving this issue, a multi-source data fusion framework based on Latent Variable Gaussian Process (LVGP) is proposed. The individual data sources are tagged as a characteristic categorical variable that are mapped into a physically interpretable latent space, allowing the development of source-aware data fusion modeling. Additionally, a dissimilarity metric based on the latent variables of LVGP is introduced to study and understand the differences in the sources of data. The proposed approach is demonstrated on and analyzed through two mathematical (representative parabola problem, 2D Ackley function) and two materials science (design of FeCrAl and SmCoFe alloys) case studies. From the case studies, it is observed that compared to using single-source and source unaware ML models, the proposed multi-source data fusion framework can provide better predictions for sparse-data problems, interpretability regarding the sources, and enhanced modeling capabilities by taking advantage of the correlations and relationships among different sources.

翻译：随着人工智能和机器学习的兴起，科学与工程领域的各个社群已利用数据驱动代理模型，基于众多信息源（数据）对复杂系统进行建模。这种发展显著降低了旨在实现特定功能的高性能系统的开发成本与时间。此类代理模型中有很大一部分是通过融合多源数据构建的，这些数据可能来自已发表的论文、专利、开放存储库或其他资源。然而，鲜有关注信息源已知和未知底层物理参数在质量和全面性上的差异，这些差异可能在系统优化过程中产生下游影响。为解决此问题，本文提出了一种基于隐变量高斯过程的多源数据融合框架。单个数据源被标记为特征分类变量，并映射至物理可解释的隐空间，从而支持源感知的数据融合建模。此外，引入了一种基于LVGP隐变量的相异性度量，以研究和理解数据源间的差异。通过两个数学案例（代表性抛物线问题、二维Ackley函数）和两个材料科学案例（FeCrAl与SmCoFe合金设计），对所提方法进行了演示与分析。案例研究表明，与使用单源和无源感知的机器学习模型相比，所提多源数据融合框架能利用不同源之间的相关性与关系，针对稀疏数据问题提供更优的预测、关于数据源的可解释性以及增强的建模能力。