Missing Not at Random (MNAR) and nonnormal data are challenging to handle. Traditional missing data analytical techniques such as full information maximum likelihood estimation (FIML) may fail with nonnormal data as they are built on normal distribution assumptions. Two-Stage Robust Estimation (TSRE) does manage nonnormal data, but both FIML and TSRE are less explored in longitudinal studies under MNAR conditions with nonnormal distributions. Unlike traditional statistical approaches, machine learning approaches do not require distributional assumptions about the data. More importantly, they have shown promise for MNAR data; however, their application in longitudinal studies, addressing both Missing at Random (MAR) and MNAR scenarios, is also underexplored. This study utilizes Monte Carlo simulations to assess and compare the effectiveness of six analytical techniques for missing data within the growth curve modeling framework. These techniques include traditional approaches like FIML and TSRE, machine learning approaches by single imputation (K-Nearest Neighbors and missForest), and machine learning approaches by multiple imputation (micecart and miceForest). We investigate the influence of sample size, missing data rate, missing data mechanism, and data distribution on the accuracy and efficiency of model estimation. Our findings indicate that FIML is most effective for MNAR data among the tested approaches. TSRE excels in handling MAR data, while missForest is only advantageous in limited conditions with a combination of very skewed distributions, very large sample sizes (e.g., n larger than 1000), and low missing data rates.
翻译:非随机缺失(MNAR)与非正态数据是处理上的难点。传统缺失数据分析技术(如全信息最大似然估计)基于正态分布假设,在处理非正态数据时可能失效。两阶段稳健估计虽能处理非正态数据,但FIML与TSRE在非正态分布MNAR条件下的纵向研究中均缺乏深入探讨。与传统统计方法不同,机器学习方法无需数据分布假设,且在MNAR数据处理中展现出潜力;然而其在纵向研究中同时处理随机缺失(MAR)与MNAR场景的应用仍待探索。本研究通过蒙特卡洛模拟,在增长曲线建模框架内评估比较六种缺失数据分析技术的效能,包括传统方法(FIML与TSRE)、基于单重插补的机器学习方法(K近邻与missForest)以及基于多重插补的机器学习方法(micecart与miceForest)。我们考察了样本量、缺失率、缺失机制及数据分布对模型估计精度与效率的影响。结果表明:在测试方法中,FIML对MNAR数据的处理最为有效;TSRE擅长处理MAR数据;而missForest仅在极端偏态分布、超大样本量(如n>1000)与低缺失率同时存在的有限条件下具有优势。