There is a rich literature on Bayesian methods for density estimation, which characterize the unknown density as a mixture of kernels. Such methods have advantages in terms of providing uncertainty quantification in estimation, while being adaptive to a rich variety of densities. However, relative to frequentist locally adaptive kernel methods, Bayesian approaches can be slow and unstable to implement in relying on Markov chain Monte Carlo algorithms. To maintain most of the strengths of Bayesian approaches without the computational disadvantages, we propose a class of nearest neighbor-Dirichlet mixtures. The approach starts by grouping the data into neighborhoods based on standard algorithms. Within each neighborhood, the density is characterized via a Bayesian parametric model, such as a Gaussian with unknown parameters. Assigning a Dirichlet prior to the weights on these local kernels, we obtain a pseudo-posterior for the weights and kernel parameters. A simple and embarrassingly parallel Monte Carlo algorithm is proposed to sample from the resulting pseudo-posterior for the unknown density. Desirable asymptotic properties are shown, and the methods are evaluated in simulation studies and applied to a motivating data set in the context of classification.
翻译:贝叶斯密度估计方法拥有丰富的文献积累,该类方法将未知密度表征为核函数的混合形式。这类方法在提供估计不确定性量化的同时,能够自适应多种密度分布。然而,相较于频率学派局部自适应核方法,贝叶斯方法因依赖马尔可夫链蒙特卡洛算法,存在计算缓慢且不稳定的缺陷。为保持贝叶斯方法的主要优势同时规避计算缺陷,本文提出一类最近邻-狄利克雷混合模型。该方法首先基于标准算法将数据分组到邻域中,在每个邻域内通过贝叶斯参数模型(如含未知参数的高斯分布)表征密度。通过为这些局部核函数的权重赋予狄利克雷先验,我们得到权重与核参数的伪后验分布。继而提出一种简洁且可高度并行化的蒙特卡洛算法,从该未知密度的伪后验分布中采样。理论分析表明该方法具有理想的渐近性质,并通过仿真实验及分类场景下的实际数据集验证了方法有效性。