Forward variable selection enables fast and accurate dynamic system identification with Karhunen-Loève decomposed Gaussian processes

A promising approach for scalable Gaussian processes (GPs) is the Karhunen-Lo\`eve (KL) decomposition, in which the GP kernel is represented by a set of basis functions which are the eigenfunctions of the kernel operator. Such decomposed kernels have the potential to be very fast, and do not depend on the selection of a reduced set of inducing points. However KL decompositions lead to high dimensionality, and variable selection becomes paramount. This paper reports a new method of forward variable selection, enabled by the ordered nature of the basis functions in the KL expansion of the Bayesian Smoothing Spline ANOVA kernel (BSS-ANOVA), coupled with fast Gibbs sampling in a fully Bayesian approach. It quickly and effectively limits the number of terms, yielding a method with competitive accuracies, training and inference times for tabular datasets of low feature set dimensionality. The inference speed and accuracy makes the method especially useful for dynamic systems identification, by modeling the dynamics in the tangent space as a static problem, then integrating the learned dynamics using a high-order scheme. The methods are demonstrated on two dynamic datasets: a `Susceptible, Infected, Recovered' (SIR) toy problem, with the transmissibility used as forcing function, along with the experimental `Cascaded Tanks' benchmark dataset. Comparisons on the static prediction of time derivatives are made with a random forest (RF), a residual neural network (ResNet), and the Orthogonal Additive Kernel (OAK) inducing points scalable GP, while for the timeseries prediction comparisons are made with LSTM and GRU recurrent neural networks (RNNs) along with the SINDy package.

翻译：可扩展高斯过程的一个有前景的方法是Karhunen-Loève分解，其中GP核由一组作为核算子特征函数的基函数表示。此类分解核具有极快的计算潜力，且无需选择降维诱导点集。然而KL分解会导致高维问题，变量选择变得至关重要。本文提出一种前向变量选择新方法，该方法利用贝叶斯平滑样条ANOVA核KL展开中基函数的有序特性，结合全贝叶斯框架下的快速吉布斯采样。该方法能快速有效地限制项数，在低特征维度的表格数据集上实现具有竞争力的精度、训练与推理时间。通过将切空间中的动力学建模为静态问题，再采用高阶格式积分学习到的动力学，其推理速度与精度使该方法特别适用于动态系统辨识。本文在两个动态数据集上演示该方法：以传染率作为强迫函数的易感-感染-恢复玩具模型，以及实验级级联水箱基准数据集。在时间导数静态预测方面，与随机森林、残差神经网络和正交加性核诱导点可扩展GP进行对比；在时间序列预测方面，与LSTM、GRU循环神经网络及SINDy工具包进行对比。