Missing data remains a very common problem in large datasets, including survey and census data containing many ordinal responses, such as political polls and opinion surveys. Multiple imputation (MI) is usually the go-to approach for analyzing such incomplete datasets, and there are indeed several implementations of MI, including methods using generalized linear models, tree-based models, and Bayesian non-parametric models. However, there is limited research on the statistical performance of these methods for multivariate ordinal data. In this article, we perform an empirical evaluation of several MI methods, including MI by chained equations (MICE) using multinomial logistic regression models, MICE using proportional odds logistic regression models, MICE using classification and regression trees, MICE using random forest, MI using Dirichlet process (DP) mixtures of products of multinomial distributions, and MI using DP mixtures of multivariate normal distributions. We evaluate the methods using simulation studies based on ordinal variables selected from the 2018 American Community Survey (ACS). Under our simulation settings, the results suggest that MI using proportional odds logistic regression models, classification and regression trees and DP mixtures of multinomial distributions generally outperform the other methods. In certain settings, MI using multinomial logistic regression models is able to achieve comparable performance, depending on the missing data mechanism and amount of missing data.
翻译:在大规模数据集中,缺失数据仍然是一个非常普遍的问题,包括包含许多序数响应的调查和人口普查数据,例如政治民意调查和意见调查。多重插补(MI)通常是分析此类不完整数据集的首选方法,并且确实存在多种MI实现方式,包括使用广义线性模型、基于树的模型和贝叶斯非参数模型的方法。然而,关于这些方法在多元序数数据上的统计性能研究有限。在本文中,我们对几种MI方法进行了实证评估,包括使用多项逻辑回归模型的链式方程多重插补(MICE)、使用比例优势逻辑回归模型的MICE、使用分类与回归树的MICE、使用随机森林的MICE、使用多项分布乘积的狄利克雷过程(DP)混合模型的MI,以及使用多元正态分布的DP混合模型的MI。我们基于从2018年美国社区调查(ACS)中选取的序数变量进行了模拟研究来评估这些方法。在我们的模拟设置下,结果表明使用比例优势逻辑回归模型、分类与回归树以及多项分布的DP混合模型的MI方法通常优于其他方法。在某些设置下,使用多项逻辑回归模型的MI能够达到相当的性能,具体取决于缺失数据机制和缺失数据量。