Bayesian nonparametric boundary detection for multiple areal data

We consider the problem of boundary detection for areal data, focusing on situations where for each areal unit multiple observations are available. We propose a Bayesian nonparametric mixture model for the area-specific population densities, with spatially dependent weights and a random number of components. Contrary to previously proposed methods for boundary detection, which consider one observation per areal unit, ours does not require external information such as area-specific covariates or dissimilarity metrics. Instead, by exploiting information from multiple samples per area, it is able to identify boundaries between areas that exhibit different densities. Crucially, the number of mixture components needs to be learned from data to obtain meaningful boundary detection, due to the non-identifiability of overfitted mixtures. Therefore, we assume it random by placing a prior on it. The motivating application is the analysis of economic inequality in the greater Los Angeles region, which typically yields social inequality and unrest. Efficient posterior computation is facilitated by a transdimensional Markov Chain Monte Carlo sampler which exploits the recently introduced \emph{optimal auxiliary priors} to improve the mixing. The methodology is validated via extensive simulations and applied to the income data in the greater Los Angeles region. We identify several boundaries in the income distributions, which can be explained \textit{ex-post} in terms of the percentage of the population without health insurance, though not in terms of the total number of crimes, showing the usefulness of such an analysis to policymakers.

翻译：本文研究区域数据的边界检测问题，重点关注每个区域单元存在多个观测值的情形。我们提出了一种针对区域特定总体密度的贝叶斯非参数混合模型，该模型具有空间依赖性权重和随机分量数量。与先前提出的边界检测方法（每个区域单元仅考虑单个观测值）不同，我们的方法无需区域特定协变量或相异性度量等外部信息。相反，通过利用每个区域多个样本的信息，该方法能够识别呈现不同密度分布的区域边界。由于过拟合混合模型的不可识别性，必须从数据中学习混合分量数量才能获得有意义的边界检测结果。因此，我们通过设置先验分布将其设为随机变量。本研究的应用背景是大洛杉矶地区经济不平等分析——该问题通常会导致社会不平等和动荡。我们采用跨维度马尔可夫链蒙特卡洛采样器进行高效后验计算，该采样器利用最新提出的"最优辅助先验"来改善混合性能。通过大量模拟验证了方法的有效性，并将其应用于大洛杉矶地区收入数据。我们在收入分布中识别出若干边界，这些边界可通过无医疗保险人口比例进行事后解释（但无法通过犯罪总数解释），证明了此类分析对政策制定者的实用价值。