Missing data is a prevalent issue in many applications, including large medical registries such as the Swedish Healthcare Quality Registries, potentially leading to biased or inefficient analyses if not handled properly. Multiple Imputation by Chained Equations (MICE) is a popular and versatile method for handling multivariate missing data but traditional implementations face significant challenges when applied to big data sets due to computational time and memory limitations. To address this, the bigMICE package was developed, adapting the MICE framework to big data using Apache Spark MLLib and Spark ML. Our implementation allows for controlling the maximum memory usage during the execution, enabling processing of very large data sets on a hardware with a limited memory, such as ordinary laptops. The developed package was tested on a large Swedish medical registry to measure memory usage, runtime and dependence of the imputation quality on sample size and on missingness proportion in the data. In conclusion, our method is generally more memory efficient and faster on large data sets compared to a commonly used MICE implementation. We also demonstrate that working with very large datasets can result in high quality imputations even when a variable has a large proportion of missing data. This paper also provides guidelines and recommendations on how to install and use our open source package.
翻译:缺失数据是许多应用中的普遍问题,包括瑞典医疗质量登记处等大型医疗登记系统,若处理不当可能导致分析结果存在偏差或效率低下。链式方程多重插补(MICE)是处理多元缺失数据的一种流行且通用的方法,但传统实现方式在应用于大数据集时,由于计算时间和内存限制面临重大挑战。为解决这一问题,bigMICE软件包应运而生,它利用Apache Spark MLLib和Spark ML将MICE框架适配于大数据场景。我们的实现允许在执行过程中控制最大内存使用量,从而能够在内存有限的硬件(如普通笔记本电脑)上处理超大规模数据集。该软件包在大型瑞典医疗登记数据上进行了测试,以测量内存使用量、运行时间,以及插补质量对样本量和数据缺失比例的依赖性。总之,与常用的MICE实现相比,我们的方法在大数据集上通常具有更高的内存效率和更快的运行速度。我们还证明,即使变量存在较高比例的缺失数据,处理超大规模数据集仍能获得高质量的插补结果。本文还提供了关于如何安装和使用我们开源软件包的指南和建议。