Latent class models are widely used for identifying unobserved subgroups from multivariate categorical data in social sciences, with binary data as a particularly popular example. However, accurately recovering individual latent class memberships remains challenging, especially when handling high-dimensional datasets with many items. This work proposes a novel two-stage algorithm for latent class models suited for high-dimensional binary responses. Our method first initializes latent class assignments by an easy-to-implement spectral clustering algorithm, and then refines these assignments with a one-step likelihood-based update. This approach combines the computational efficiency of spectral clustering with the improved statistical accuracy of likelihood-based estimation. We establish theoretical guarantees showing that this method is minimax-optimal for latent class recovery in the statistical decision theory sense. The method also leads to exact clustering of subjects with high probability under mild conditions. As a byproduct, we propose a computationally efficient consistent estimator for the number of latent classes. Extensive experiments on both simulated data and real data validate our theoretical results and demonstrate our method's superior performance over alternative methods.
翻译:潜在类别模型广泛应用于社会科学中从多元分类数据识别未观测子群,其中二元数据尤为常见。然而,准确恢复个体潜在类别归属仍然具有挑战性,特别是在处理具有大量项目的高维数据集时。本研究提出一种适用于高维二元响应的潜在类别模型新型两阶段算法。我们的方法首先通过易于实现的谱聚类算法初始化潜在类别分配,随后通过基于似然的一步更新优化这些分配。该方法结合了谱聚类的计算效率与基于似然估计的改进统计精度。我们建立了理论保证,证明该方法在统计决策理论意义下对潜在类别恢复具有极小极大最优性。在温和条件下,该方法还能以高概率实现样本的精确聚类。作为副产品,我们提出了一种计算高效且一致的潜在类别数量估计量。在模拟数据和真实数据上的大量实验验证了我们的理论结果,并证明了该方法相对于其他方法的优越性能。