In this article we perform an asymptotic analysis of parallel Bayesian logspline density estimators. Such estimators are useful for the analysis of datasets that are partitioned into subsets and stored in separate databases without the capability of accessing the full dataset from a single computer. The parallel estimator we introduce is in the spirit of a kernel density estimator introduced in recent studies. We provide a numerical procedure that produces the normalized density estimator itself in place of the sampling algorithm. We then derive an error bound for the mean integrated squared error of the full dataset posterior estimator. The error bound depends upon the parameters that arise in logspline density estimation and the numerical approximation procedure. In our analysis, we identify the choices for the parameters that result in the error bound scaling optimally in relation to the number of samples. This provides our method with increased estimation accuracy, while also minimizing the computational cost.
翻译:本文对并行贝叶斯log样条密度估计器进行渐近分析。此类估计器适用于数据集被分割成子集并存储在独立数据库中的情形,且无法从单台计算机访问完整数据集。我们引入的并行估计器遵循近期研究中核密度估计器的思路。本文提供了一种数值方法,直接生成归一化密度估计器而非采样算法。随后我们推导了完整数据集后验估计器的均方积分误差界。该误差界依赖于log样条密度估计与数值近似过程中产生的参数。通过分析,我们确定了能使误差界随样本数量实现最优缩放的参数选择。这使得我们的方法在提升估计精度的同时,也最小化了计算成本。