Compositional data, representing proportions constrained to the simplex, arise in diverse fields such as geosciences, ecology, genomics, and microbiome research. Existing nonparametric density estimation methods often rely on transformations, which may induce substantial bias near the simplex boundary. We propose a nonparametric mixture-based framework for density estimation on compositions. Nonparametric Dirichlet mixtures are employed to naturally accommodate boundary values, thereby avoiding the transformation or zero-replacement, while also identifying components supported on the boundary, providing reliable estimates for data with zero or near-zero values. Bandwidth selection and initialization schemes are addressed. For comparison, nonparametric Gaussian mixtures, coupled with log-ratio transformations, are also considered. Extensive simulations show that the proposed estimators outperform existing approaches. Three real data applications, including GDP data analysis, handwritten digit recognition, and skin detection, demonstrate the usefulness of nonparametric Dirichlet mixtures in practice.
翻译:成分数据表示受限于单纯形的比例数据,广泛出现于地球科学、生态学、基因组学及微生物组研究等领域。现有的非参数密度估计方法通常依赖于数据变换,这可能在单纯形边界附近引入显著偏差。本文提出一种基于非参数混合模型的成分数据密度估计框架。通过采用非参数狄利克雷混合模型,该方法能够自然适应边界值,从而避免数据变换或零值替换处理,同时还能识别边界支撑的混合成分,为零值或近零值数据提供可靠的密度估计。本文讨论了带宽选择与初始化方案。作为对比,同时考虑了结合对数比变换的非参数高斯混合模型。大量模拟实验表明,所提出的估计量优于现有方法。三个实际数据应用——包括GDP数据分析、手写数字识别和皮肤检测——验证了非参数狄利克雷混合模型在实际应用中的有效性。