Evaluation of Missing Data Analytical Techniques in Longitudinal Research: Traditional and Machine Learning Approaches

Missing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. Two-Stage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under MNAR conditions with nonnormal distributions. Unlike traditional statistical approaches, machine learning approaches do not require distributional assumptions about the data. More importantly, they have shown promise for MNAR data; however, their application in longitudinal studies, addressing both Missing at Random (MAR) and MNAR scenarios, is also underexplored. This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. These techniques include traditional approaches like FIML and TSRE, machine learning approaches by single imputation (K-Nearest Neighbors and missForest), and machine learning approaches by multiple imputation (micecart and miceForest). We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation. Our findings indicate that FIML is most effective for MNAR data among the tested approaches. TSRE excels in handling MAR data, while missForest is only advantageous in limited conditions with a combination of very skewed distributions, very large sample sizes (e.g., n larger than 1000), and low missing data rates.

翻译：非随机缺失（MNAR）与非正态数据是处理上的难点。传统缺失数据分析技术（如全信息最大似然估计）基于正态分布假设，在处理非正态数据时可能失效。两阶段稳健估计虽能处理非正态数据，但FIML与TSRE在非正态分布MNAR条件下的纵向研究中均缺乏深入探讨。与传统统计方法不同，机器学习方法无需数据分布假设，且在MNAR数据处理中展现出潜力；然而其在纵向研究中同时处理随机缺失（MAR）与MNAR场景的应用仍待探索。本研究通过蒙特卡洛模拟，在增长曲线建模框架内评估比较六种缺失数据分析技术的效能，包括传统方法（FIML与TSRE）、基于单重插补的机器学习方法（K近邻与missForest）以及基于多重插补的机器学习方法（micecart与miceForest）。我们考察了样本量、缺失率、缺失机制及数据分布对模型估计精度与效率的影响。结果表明：在测试方法中，FIML对MNAR数据的处理最为有效；TSRE擅长处理MAR数据；而missForest仅在极端偏态分布、超大样本量（如n>1000）与低缺失率同时存在的有限条件下具有优势。

相关内容

Machine Learning

关注 2249

机器学习（Machine Learning）是一个研究计算学习方法的国际论坛。该杂志发表文章，报告广泛的学习方法应用于各种学习问题的实质性结果。该杂志的特色论文描述研究的问题和方法，应用研究和研究方法的问题。有关学习问题或方法的论文通过实证研究、理论分析或与心理现象的比较提供了坚实的支持。应用论文展示了如何应用学习方法来解决重要的应用问题。研究方法论文改进了机器学习的研究方法。所有的论文都以其他研究人员可以验证或复制的方式描述了支持证据。论文还详细说明了学习的组成部分，并讨论了关于知识表示和性能任务的假设。官网地址：http://dblp.uni-trier.de/db/journals/ml/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

Linux导论，Introduction to Linux，96页ppt

专知会员服务

82+阅读 · 2020年7月26日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日