Stabilised weighted data subsampling for accelerated inference in models with recursive likelihoods

from arxiv, Version 2: Revised and shortened for journal submission. Some technical material has been moved from the main paper to the appendix and supplementary material. Minor improvements to exposition and presentation. No substantive changes to the methodology, theoretical results, or conclusions. This version includes the main manuscript, appendix, and supplementary material in a single file

Inference for models with recursively defined likelihoods is computationally demanding, limiting scalability to large datasets. We propose a stabilised weighted subsampling methodology for accelerated inference based on an unbiased estimator of the log-likelihood. By assigning higher sampling probabilities to early observations, the method reduces the effective depth of recursive likelihood evaluations and hence computational cost. However, sampling probabilities that decay too slowly yield limited savings, while overly aggressive decay can substantially inflate estimator variance. We develop a stabilisation framework, supported by theory, that restricts the decay to avoid both computational and variance pathologies through principled hyperparameter tuning. We also derive an unbiased subsampling estimator of the log-likelihood gradient, enabling gradient-based inference. The methodology can be embedded within a range of inferential frameworks. We illustrate its use in variational Bayes and subsampling Markov chain Monte Carlo for conditional volatility models, including leverage effects. Empirical results show substantial computational speed-ups relative to full-data methods while maintaining inferential accuracy. We also compare with recent stochastic gradient MCMC and divide-and-conquer MCMC methods for temporally dependent data, observing favourable empirical performance.

翻译：递归定义似然的模型在推断时计算成本高昂，限制了其对大规模数据的可扩展性。我们提出一种基于对数似然无偏估计量的稳定加权子采样方法，用于加速推断。该方法通过为早期观测赋予更高采样概率，降低递归似然计算的有效深度，从而减少计算成本。然而，若采样概率衰减过慢则节省效果有限，而衰减过快则会大幅放大估计量方差。我们构建了一个有理论支持的稳定化框架，通过原则性超参数调优限制衰减程度，从而避免计算和方差病理问题。此外，我们推导了对数似然梯度的无偏子采样估计量，支持基于梯度的推断。该方法可嵌入多种推断框架中。我们以条件波动率模型（包含杠杆效应）为例，展示了其在变分贝叶斯和子采样马尔可夫链蒙特卡洛方法中的应用。实验结果表明，相较于全数据方法，该方法在保持推断精度的同时大幅提升了计算速度。我们还将该方法与近期针对时间依赖数据的随机梯度MCMC和分治MCMC方法进行对比，观察到其优异的实证表现。