Finite mixture models are widely used in econometric analyses to capture unobserved heterogeneity. This paper shows that maximum likelihood estimation of finite mixtures of parametric densities can suffer from substantial finite-sample bias in all parameters under mild regularity conditions. The bias arises from the influence of outliers in component densities with unbounded or large support and increases with the degree of overlap among mixture components. I show that maximizing the classification-mixture likelihood function, equipped with a consistent classifier, yields parameter estimates that are less biased than those obtained by standard maximum likelihood estimation (MLE). I then derive the asymptotic distribution of the resulting estimator and provide conditions under which oracle efficiency is achieved. Monte Carlo simulations show that conventional mixture MLE exhibits pronounced finite-sample bias, which diminishes as the sample size or the statistical distance between component densities tends to infinity. The simulations further show that the proposed estimation strategy generally outperforms standard MLE in finite samples in terms of both bias and mean squared errors under relatively weak assumptions. An empirical application to latent group panel structures using health administrative data shows that the proposed approach reduces out-of-sample prediction error by approximately 17.6% relative to the best results obtained from standard MLE procedures.
翻译:有限混合模型在计量经济学分析中被广泛用于捕捉未观测异质性。本文表明,在温和的正则条件下,参数密度有限混合模型的最大似然估计在所有参数上均可能遭受显著的有限样本偏误。该偏误源于具有无界或较大支撑集的分量密度中异常值的影响,并随混合分量间重叠程度的增加而加剧。本文证明,在配备一致分类器的条件下,最大化分类-混合似然函数所得到的参数估计量,其偏误小于标准最大似然估计(MLE)所得结果。随后,我推导了该估计量的渐近分布,并给出了达到Oracle效率的条件。蒙特卡洛模拟显示,传统混合模型MLE存在明显的有限样本偏误,该偏误会随样本量或分量密度间统计距离趋于无穷大而逐渐消失。模拟结果进一步表明,在相对较弱的假设下,所提出的估计策略在有限样本中通常于偏误和均方误差方面均优于标准MLE。通过使用健康管理数据对潜在分组面板结构的实证应用表明,相较于标准MLE程序获得的最佳结果,所提方法将样本外预测误差降低了约17.6%。