We present DeepMVI, a deep learning method for missing value imputation in multidimensional time-series datasets. Missing values are commonplace in decision support platforms that aggregate data over long time stretches from disparate sources, and reliable data analytics calls for careful handling of missing data. One strategy is imputing the missing values, and a wide variety of algorithms exist spanning simple interpolation, matrix factorization methods like SVD, statistical models like Kalman filters, and recent deep learning methods. We show that often these provide worse results on aggregate analytics compared to just excluding the missing data. DeepMVI uses a neural network to combine fine-grained and coarse-grained patterns along a time series, and trends from related series across categorical dimensions. After failing with off-the-shelf neural architectures, we design our own network that includes a temporal transformer with a novel convolutional window feature, and kernel regression with learned embeddings. The parameters and their training are designed carefully to generalize across different placements of missing blocks and data characteristics. Experiments across nine real datasets, four different missing scenarios, comparing seven existing methods show that DeepMVI is significantly more accurate, reducing error by more than 50% in more than half the cases, compared to the best existing method. Although slower than simpler matrix factorization methods, we justify the increased time overheads by showing that DeepMVI is the only option that provided overall more accurate analytics than dropping missing values.
翻译:我们提出DeepMVI,一种用于多维时间序列数据集中缺失值插补的深度学习方法。在从不同来源聚合长时间跨度数据的决策支持平台中,缺失值普遍存在,而可靠的数据分析要求谨慎处理缺失数据。一种策略是插补缺失值,现有多种算法,包括简单插值、SVD等矩阵分解方法、卡尔曼滤波等统计模型以及最新的深度学习方法。我们证明,这些方法在聚合分析中的结果往往比直接排除缺失数据更差。DeepMVI使用神经网络结合时间序列中的细粒度和粗粒度模式,并利用跨类别维度的相关序列趋势。在现成的神经架构失败后,我们设计了自研网络,其中包括带有新型卷积窗口特征的时序变换器,以及基于学习嵌入的核回归。参数及其训练经过精心设计,能泛化至不同缺失块位置和数据特征。在九个真实数据集、四种不同缺失场景下,对比七种现有方法的实验表明,DeepMVI显著更准确,在超半数案例中误差降低超过50%,优于最佳现有方法。尽管速度慢于简单矩阵分解方法,我们通过证明DeepMVI是唯一能提供比丢弃缺失值更准确分析结果的选项,论证了其额外时间开销的合理性。