Bayesian coresets speed up posterior inference in the large-scale data regime by approximating the full-data log-likelihood function with a surrogate log-likelihood based on a small, weighted subset of the data. But while Bayesian coresets and methods for construction are applicable in a wide range of models, existing theoretical analysis of the posterior inferential error incurred by coreset approximations only apply in restrictive settings -- i.e., exponential family models, or models with strong log-concavity and smoothness assumptions. This work presents general upper and lower bounds on the Kullback-Leibler (KL) divergence of coreset approximations that reflect the full range of applicability of Bayesian coresets. The lower bounds require only mild model assumptions typical of Bayesian asymptotic analyses, while the upper bounds require the log-likelihood functions to satisfy a generalized subexponentiality criterion that is weaker than conditions used in earlier work. The lower bounds are applied to obtain fundamental limitations on the quality of coreset approximations, and to provide a theoretical explanation for the previously-observed poor empirical performance of importance sampling-based construction methods. The upper bounds are used to analyze the performance of recent subsample-optimize methods. The flexibility of the theory is demonstrated in validation experiments involving multimodal, unidentifiable, heavy-tailed Bayesian posterior distributions.
翻译:贝叶斯核心集通过基于小型加权数据子集的替代对数似然函数来近似全数据对数似然函数,从而加速大规模数据场景下的后验推断。尽管贝叶斯核心集及其构造方法可适用于广泛模型,但现有关于核心集近似导致的后验推断误差的理论分析仅适用于受限场景——即指数族模型,或需满足强对数凹性和光滑性假设的模型。本文提出了贝叶斯核心集近似中Kullback-Leibler散度的通用上下界,这些界反映了贝叶斯核心集的完整适用范围。下界仅需贝叶斯渐近分析中典型的温和模型假设,而上界要求对数似然函数满足广义次指数准则,该条件弱于先前研究中的约束。下界可用于获取核心集近似质量的本质局限,并从理论上解释基于重要性采样的构造方法此前被观察到的较差实证表现。上界则用于分析近期子采样优化方法的性能。通过涉及多模态、不可识别、重尾贝叶斯后验分布的验证实验,展示了该理论的灵活性。