Missing data is a widespread problem in many domains, creating challenges in data analysis and decision making. Traditional techniques for dealing with missing data, such as excluding incomplete records or imputing simple estimates (e.g., mean), are computationally efficient but may introduce bias and disrupt variable relationships, leading to inaccurate analyses. Model-based imputation techniques offer a more robust solution that preserves the variability and relationships in the data, but they demand significantly more computation time, limiting their applicability to small datasets. This work enables efficient, high-quality, and scalable data imputation within a database system using the widely used MICE method. We adapt this method to exploit computation sharing and a ring abstraction for faster model training. To impute both continuous and categorical values, we develop techniques for in-database learning of stochastic linear regression and Gaussian discriminant analysis models. Our MICE implementations in PostgreSQL and DuckDB outperform alternative MICE implementations and model-based imputation techniques by up to two orders of magnitude in terms of computation time, while maintaining high imputation quality.
翻译:缺失数据是许多领域中普遍存在的问题,给数据分析和决策制定带来了挑战。处理缺失数据的传统技术(如删除不完整记录或填补简单估计值(例如均值))计算效率高,但可能引入偏差并破坏变量之间的关系,从而导致分析不准确。基于模型的填补技术提供了一种更稳健的解决方案,能够保留数据的变异性和关系,但需要显著更多的计算时间,限制了其在小数据集上的适用性。本研究通过广泛使用的MICE方法,在数据库系统内实现了高效、高质量且可扩展的数据填补。我们对该方法进行了改造,利用计算共享和环抽象来加速模型训练。为填补连续值和分类值,我们开发了数据库内学习随机线性回归和高斯判别分析模型的技术。我们在PostgreSQL和DuckDB中实现的MICE方法在计算时间上比替代的MICE实现和基于模型的填补技术快两个数量级,同时保持了较高的填补质量。