Spatial best linear unbiased prediction: A computational mathematics approach for high dimensional massive datasets

With the advent of massive data sets much of the computational science and engineering community has moved toward data-intensive approaches in regression and classification. However, these present significant challenges due to increasing size, complexity and dimensionality of the problems. In particular, covariance matrices in many cases are numerically unstable and linear algebra shows that often such matrices cannot be inverted accurately on a finite precision computer. A common ad hoc approach to stabilizing a matrix is application of a so-called nugget. However, this can change the model and introduce error to the original solution. It is well known from numerical analysis that ill-conditioned matrices cannot be accurately inverted. In this paper we develop a multilevel computational method that scales well with the number of observations and dimensions. A multilevel basis is constructed adapted to a kD-tree partitioning of the observations. Numerically unstable covariance matrices with large condition numbers can be transformed into well conditioned multilevel ones without compromising accuracy. Moreover, it is shown that the multilevel prediction exactly solves the Best Linear Unbiased Predictor (BLUP) and Generalized Least Squares (GLS) model, but is numerically stable. The multilevel method is tested on numerically unstable problems of up to 25 dimensions. Numerical results show speedups of up to 42,050 times for solving the BLUP problem, but with the same accuracy as the traditional iterative approach. For very ill-conditioned cases the speedup is infinite. In addition, decay estimates of the multilevel covariance matrices are derived based on high dimensional interpolation techniques from the field of numerical analysis. This work lies at the intersection of statistics, uncertainty quantification, high performance computing and computational applied mathematics.

翻译：随着海量数据集的出现，计算科学与工程领域的众多研究已转向数据密集型回归与分类方法。然而，这些问题因规模、复杂度和维度的日益增长而面临重大挑战。具体而言，协方差矩阵在许多情况下数值不稳定，线性代数表明此类矩阵在有限精度计算机上往往无法精确求逆。一种常见的近似稳定化方法是引入所谓的"核金"（nugget），但这可能改变模型并引入原始解的误差。数值分析领域众所周知，病态矩阵无法精确求逆。本文提出一种与观测数和维度均能良好扩展的多层计算方法。基于观测数据的kD树划分，构建自适应多层基函数。具有大条件数的数值不稳定协方差矩阵可在不损失精度的情况下转化为良态多层矩阵。研究表明，多层预测精确求解了最佳线性无偏预测（BLUP）与广义最小二乘（GLS）模型，同时保持数值稳定性。该方法在最高25维的数值不稳定问题中进行了测试。数值结果表明，求解BLUP问题的加速比最高达42050倍，且精度与传统迭代方法相当；对于极度病态情形，加速比趋近无穷大。此外，基于数值分析领域的高维插值技术，推导出多层协方差矩阵的衰减估计。本工作处于统计学、不确定性量化、高性能计算与计算应用数学的交叉领域。