Dealing with missing data poses significant challenges in predictive analysis, often leading to biased conclusions when oversimplified assumptions about the missing data process are made. In cases where the data are missing not at random (MNAR), jointly modeling the data and missing data indicators is essential. Motivated by a real data application with partially missing multivariate outcomes related to leaf photosynthetic traits and several environmental covariates, we propose two methods under a selection model framework for handling data with missingness in the response variables suitable for recovering various missingness mechanisms. Both approaches use a multivariate extension of Bayesian additive regression trees (BART) to flexibly model the outcomes. The first approach simultaneously uses a probit regression model to jointly model the missingness. In scenarios where the relationship between the missingness and the data is more complex or non-linear, we propose a second approach using a probit BART model to characterize the missing data process, thereby employing two BART models simultaneously. Both models also effectively handle ignorable covariate missingness. The efficacy of both models compared to existing missing data approaches is demonstrated through extensive simulations, in both univariate and multivariate settings, and through the aforementioned application to the leaf photosynthetic trait data.
翻译:在预测分析中,处理缺失数据带来了重大挑战,当对缺失数据过程做出过度简化的假设时,往往会导致有偏的结论。在数据非随机缺失(MNAR)的情况下,对数据和缺失数据指示变量进行联合建模至关重要。受一个真实数据应用的启发——该应用涉及与叶片光合性状相关的部分缺失多元结果变量及若干环境协变量——我们在选择模型框架下提出了两种方法,用于处理响应变量存在缺失的数据,这些方法适用于恢复各种缺失机制。两种方法均使用贝叶斯加性回归树(BART)的多元扩展来灵活地对结果变量进行建模。第一种方法同时使用概率回归模型来联合建模缺失机制。在缺失机制与数据之间的关系更为复杂或非线性的场景下,我们提出了第二种方法,使用概率BART模型来刻画缺失数据过程,从而同时使用两个BART模型。两种模型也能有效处理可忽略的协变量缺失。通过广泛的模拟研究(包括单变量和多元设置)以及前述对叶片光合性状数据的应用,证明了这两种模型相较于现有缺失数据处理方法的有效性。