We present a method to approximate Gaussian process regression models for large datasets by considering only a subset of the data. Our approach is novel in that the size of the subset is selected on the fly during exact inference with little computational overhead. From an empirical observation that the log-marginal likelihood often exhibits a linear trend once a sufficient subset of a dataset has been observed, we conclude that many large datasets contain redundant information that only slightly affects the posterior. Based on this, we provide probabilistic bounds on the full model evidence that can identify such subsets. Remarkably, these bounds are largely composed of terms that appear in intermediate steps of the standard Cholesky decomposition, allowing us to modify the algorithm to adaptively stop the decomposition once enough data have been observed.
翻译:我们提出了一种方法,通过仅考虑数据子集来近似大规模数据集的高斯过程回归模型。该方法的新颖之处在于,子集的大小可在精确推断过程中动态选择,且计算开销极低。基于对对数边际似然的经验观察——当观测到足够大的数据集子集后,该函数常呈现线性趋势,我们推断许多大规模数据集中包含对后验影响甚微的冗余信息。据此,我们提供了标识此类子集的完整模型证据概率界。值得注意的是,这些界主要由标准Cholesky分解中间步骤中出现的项构成,使得我们能够修改该算法,在观测到足够数据后自适应地停止分解。