Improving Data Cleaning Using Discrete Optimization

One of the most important processing steps in any analysis pipeline is handling missing data. Traditional approaches simply delete any sample or feature with missing elements. Recent imputation methods replace missing data based on assumed relationships between observed data and the missing elements. However, there is a largely under-explored alternative amid these extremes. Partial deletion approaches remove excessive amounts of missing data, as defined by the user. They can be used in place of traditional deletion or as a precursor to imputation. In this manuscript, we expand upon the Mr. Clean suite of algorithms, focusing on the scenario where all missing data is removed. We show that the RowCol Integer Program can be recast as a Linear Program, thereby reducing runtime. Additionally, the Element Integer Program can be reformulated to reduce the number of variables and allow for high levels of parallelization. Using real-world data sets from genetic, gene expression, and single cell RNA-seq experiments we demonstrate that our algorithms outperform existing deletion techniques over several missingness values, balancing runtime and data retention. Our combined greedy algorithm retains the maximum number of valid elements in 126 of 150 scenarios and stays within 1\% of maximum in 23 of the remaining experiments. The reformulated Element IP complements the greedy algorithm when removing all missing data, boasting a reduced runtime and increase in valid elements in larger data sets, over its generic counterpart. These two programs greatly increase the amount of valid data retained over traditional deletion techniques and further improve on existing partial deletion algorithms.

翻译：在任何分析流程中，最重要的处理步骤之一便是处理缺失数据。传统方法会直接删除含有缺失元素的样本或特征。近年来的插补方法则根据观测数据与缺失元素之间的假定关系来替换缺失数据。然而，在上述两种极端方法之间，存在一个很大程度上尚未充分探索的替代方案：部分删除法。该方法会删除用户指定的过量缺失数据，既可替代传统删除法，也可作为插补的前置步骤。在本文中，我们扩展了Mr. Clean算法套件，重点研究移除所有缺失数据的情况。我们证明，RowCol整数规划问题可转化为线性规划问题，从而缩短运行时间。此外，Element整数规划问题也可重新构建，以减少变量数量并实现高度并行化。通过使用来自遗传学、基因表达和单细胞RNA测序实验的真实数据集，我们证明，在多种缺失率下，我们的算法在平衡运行时间与数据保留方面优于现有删除技术。我们的组合贪心算法在150种场景中的126种中保留了最多有效元素，并在其余24种实验中的23种中与最大值相差在1%以内。在移除所有缺失数据时，重构后的Element IP算法补充了贪心算法的不足，在大型数据集上相比通用版本具有更短的运行时间和更多的有效元素保留。这两种程序相比传统删除技术大幅增加了保留的有效数据量，并进一步改进了现有的部分删除算法。