We consider the problem of inferring an unknown number of clusters in replicated multinomial data. Under a model based clustering point of view, this task can be treated by estimating finite mixtures of multinomial distributions with or without covariates. Both Maximum Likelihood (ML) as well as Bayesian estimation are taken into account. Under a Maximum Likelihood approach, we provide an Expectation--Maximization (EM) algorithm which exploits a careful initialization procedure combined with a ridge--stabilized implementation of the Newton--Raphson method in the M--step. Under a Bayesian setup, a stochastic gradient Markov chain Monte Carlo (MCMC) algorithm embedded within a prior parallel tempering scheme is devised. The number of clusters is selected according to the Integrated Completed Likelihood criterion in the ML approach and estimating the number of non-empty components in overfitting mixture models in the Bayesian case. Our method is illustrated in simulated data and applied to two real datasets. An R package is available at https://github.com/mqbssppe/multinomialLogitMix.
翻译:我们考虑了在重复多项式数据中推断未知聚类数的问题。在基于模型的聚类视角下,该任务可通过估计带或不带协变量的多项分布有限混合模型来解决。我们同时考虑了最大似然估计和贝叶斯估计方法。在最大似然框架下,我们提出了一种期望最大化算法,该算法利用精心设计的初始化过程,并在M步中结合了基于岭稳定的牛顿-拉夫森方法实现。在贝叶斯框架下,我们设计了一种嵌入先验并行回火方案的随机梯度马尔可夫链蒙特卡洛算法。在最大似然方法中,根据集成完全似然准则选择聚类数;在贝叶斯情形中,则通过估计过拟合混合模型中非空成分的数量来确定。我们的方法在模拟数据上进行了验证,并应用于两个真实数据集。一个R包可在https://github.com/mqbssppe/multinomialLogitMix获取。