Zero-inflated count data arise in various fields, including health, biology, economics, and the social sciences. These data are often modelled using probabilistic distributions such as zero-inflated Poisson (ZIP), zero-inflated negative binomial (ZINB), or zero-inflated binomial (ZIB). To account for heterogeneity in the data, it is often useful to cluster observations into groups that may explain underlying differences in the data-generating process. This paper focuses on model-based clustering for zero-inflated counts when observations are structured in a matrix form rather than a vector. We propose a clustering framework based on mixtures of ZIP or ZINB distributions, with both the count and zero components depending on cluster assignments. Our approach incorporates covariates through a log-linear structure for the mean parameter and includes a size factor to adjust for differences in total sampling or exposure. Model parameters and cluster assignments are estimated via the Expectation-Maximization (EM) algorithm. We assess the performance of our proposed methodology through simulation studies evaluating clustering accuracy and estimator properties, followed by applications to publicly available datasets.
翻译:零膨胀计数数据广泛出现于健康、生物学、经济学及社会科学等多个领域。此类数据通常采用零膨胀泊松(ZIP)、零膨胀负二项(ZINB)或零膨胀二项(ZIB)等概率分布进行建模。为处理数据中的异质性,常需将观测值聚类为能够解释数据生成过程中潜在差异的组别。本文聚焦于当观测值以矩阵形式而非向量形式组织时,针对零膨胀计数的模型聚类方法。我们提出了一种基于ZIP或ZINB混合分布的聚类框架,其中计数部分与零膨胀部分均依赖于聚类归属。该方法通过均值参数的对数线性结构纳入协变量,并引入尺度因子以校正总采样量或暴露水平的差异。模型参数与聚类归属通过期望最大化(EM)算法进行估计。我们通过模拟研究评估聚类精度与估计量性质以检验所提方法的性能,随后将其应用于公开数据集进行实证分析。