We introduce a novel statistical significance-based approach for clustering hierarchical data using semi-parametric linear mixed-effects models designed for responses with laws in the exponential family (e.g., Poisson and Bernoulli). Within the family of semi-parametric mixed-effects models, a latent clustering structure of the highest-level units can be identified by assuming the random effects to follow a discrete distribution with an unknown number of support points. We achieve this by computing {\alpha}-level confidence regions of the estimated support point and identifying statistically different clusters. At each iteration of a tailored Expectation Maximization algorithm, the two closest estimated support points for which the confidence regions overlap collapse. Unlike the related state-of-the-art methods that rely on arbitrary thresholds to determine the merging of close discrete masses, the proposed approach relies on conventional statistical confidence levels, thereby avoiding the use of discretionary tuning parameters. To demonstrate the effectiveness of our approach, we apply it to data from the Programme for International Student Assessment (PISA - OECD) to cluster countries based on the rate of innumeracy levels in schools. Additionally, a simulation study and comparison with classical parametric and state-of-the-art models are provided and discussed.
翻译:我们提出了一种新颖的基于统计显著性的层次数据聚类方法,该方法采用针对指数族(例如泊松和伯努利分布)响应律设计的半参数线性混合效应模型。在半参数混合效应模型框架内,通过假设随机效应服从具有未知支撑点数量的离散分布,可以识别最高层级单元的潜在聚类结构。我们通过计算估计支撑点的α水平置信域并识别统计上不同的聚类来实现这一点。在定制化的期望最大化算法的每次迭代中,两个置信域重叠的最近估计支撑点会合并。与依赖任意阈值来确定相近离散质量合并的相关现有方法不同,所提方法依赖于传统的统计置信水平,从而避免了使用任意调整参数。为了证明我们方法的有效性,我们将其应用于国际学生评估项目(PISA - OECD)的数据,以根据学校中的计算能力不足率对国家进行聚类。此外,还提供并讨论了仿真研究以及与经典参数模型和现有模型的比较。