A Comparative Study of Imputation Methods for Multivariate Ordinal Data

from arxiv, This is a pre-copyedited, author-produced version of an article accepted for publication in Journal of Survey Statistics and Methodology following peer review

Missing data remains a very common problem in large datasets, including survey and census data containing many ordinal responses, such as political polls and opinion surveys. Multiple imputation (MI) is usually the go-to approach for analyzing such incomplete datasets, and there are indeed several implementations of MI, including methods using generalized linear models, tree-based models, and Bayesian non-parametric models. However, there is limited research on the statistical performance of these methods for multivariate ordinal data. In this article, we perform an empirical evaluation of several MI methods, including MI by chained equations (MICE) using multinomial logistic regression models, MICE using proportional odds logistic regression models, MICE using classification and regression trees, MICE using random forest, MI using Dirichlet process (DP) mixtures of products of multinomial distributions, and MI using DP mixtures of multivariate normal distributions. We evaluate the methods using simulation studies based on ordinal variables selected from the 2018 American Community Survey (ACS). Under our simulation settings, the results suggest that MI using proportional odds logistic regression models, classification and regression trees and DP mixtures of multinomial distributions generally outperform the other methods. In certain settings, MI using multinomial logistic regression models is able to achieve comparable performance, depending on the missing data mechanism and amount of missing data.

翻译：在大规模数据集中，缺失数据仍然是一个非常普遍的问题，包括包含许多序数响应的调查和人口普查数据，例如政治民意调查和意见调查。多重插补（MI）通常是分析此类不完整数据集的首选方法，并且确实存在多种MI实现方式，包括使用广义线性模型、基于树的模型和贝叶斯非参数模型的方法。然而，关于这些方法在多元序数数据上的统计性能研究有限。在本文中，我们对几种MI方法进行了实证评估，包括使用多项逻辑回归模型的链式方程多重插补（MICE）、使用比例优势逻辑回归模型的MICE、使用分类与回归树的MICE、使用随机森林的MICE、使用多项分布乘积的狄利克雷过程（DP）混合模型的MI，以及使用多元正态分布的DP混合模型的MI。我们基于从2018年美国社区调查（ACS）中选取的序数变量进行了模拟研究来评估这些方法。在我们的模拟设置下，结果表明使用比例优势逻辑回归模型、分类与回归树以及多项分布的DP混合模型的MI方法通常优于其他方法。在某些设置下，使用多项逻辑回归模型的MI能够达到相当的性能，具体取决于缺失数据机制和缺失数据量。

相关内容

MoDELS

关注 45

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日