Functional data analysis (FDA) finds widespread application across various fields, due to data being recorded continuously over a time interval or at several discrete points. Since the data is not observed at every point but rather across a dense grid, smoothing techniques are often employed to convert the observed data into functions. In this work, we propose a novel Bayesian approach for selecting basis functions for smoothing one or multiple curves simultaneously. Our method differentiates from other Bayesian approaches in two key ways: (i) by accounting for correlated errors and (ii) by developing a variational EM algorithm instead of a Gibbs sampler. Simulation studies demonstrate that our method effectively identifies the true underlying structure of the data across various scenarios and it is applicable to different types of functional data. Our variational EM algorithm not only recovers the basis coefficients and the correct set of basis functions but also estimates the existing within-curve correlation. When applied to the motorcycle dataset, our method demonstrates comparable, and in some cases superior, performance in terms of adjusted $R^2$ compared to other techniques such as regression splines, Bayesian LASSO and LASSO. Additionally, when assuming independence among observations within a curve, our method, utilizing only a variational Bayes algorithm, is in the order of thousands faster than a Gibbs sampler on average. Our proposed method is implemented in R and codes are available at https://github.com/acarolcruz/VB-Bases-Selection.
翻译:函数数据分析(FDA)因数据在时间区间内连续记录或在若干离散点处采集而广泛应用于各领域。由于数据并非逐点观测而是在密集网格上采集,通常需采用平滑技术将观测数据转化为函数。本研究提出一种新颖的贝叶斯方法,用于同时平滑单条或多条曲线的基函数选择。本方法在以下两方面区别于其他贝叶斯方法:(i)考虑相关误差;(ii)开发变分EM算法而非吉布斯采样器。模拟研究表明,本方法能有效识别不同情境下数据的真实底层结构,且适用于各类函数数据。所提出的变分EM算法不仅能恢复基系数与正确的基函数集合,还能估计曲线内部存在的相关性。在摩托车数据集的应用中,相较于回归样条、贝叶斯LASSO和LASSO等方法,本方法在调整$R^2$指标上表现出相当甚至更优的性能。此外,在假设同一曲线内观测值相互独立时,仅采用变分贝叶斯算法的本方法平均比吉布斯采样器快数千倍。本方法已在R语言中实现,代码发布于https://github.com/acarolcruz/VB-Bases-Selection。