Time series imputation is a critical challenge in data mining, particularly in domains like healthcare and environmental monitoring, where missing data can compromise analytical outcomes. This study investigates the influence of diverse masking strategies, normalization timing, and missingness patterns on the performance of eleven state-of-the-art imputation models across three diverse datasets. Specifically, we evaluate the effects of pre-masking versus in-mini-batch masking, augmentation versus overlaying of artificial missingness, and pre-normalization versus post-normalization. Our findings reveal that masking strategies profoundly affect imputation accuracy, with dynamic masking providing robust augmentation benefits and overlay masking better simulating real-world missingness patterns. Sophisticated models, such as CSDI, exhibited sensitivity to preprocessing configurations, while simpler models like BRITS delivered consistent and efficient performance. We highlight the importance of aligning preprocessing pipelines and masking strategies with dataset characteristics to improve robustness under diverse conditions, including high missing rates. This study provides actionable insights for designing imputation pipelines and underscores the need for transparent and comprehensive experimental designs.
翻译:时间序列插补是数据挖掘领域的一项关键挑战,在医疗保健和环境监测等领域尤为重要,因为缺失数据可能损害分析结果。本研究探讨了三种不同数据集上,多样化掩码策略、归一化时机与缺失模式对十一种先进插补模型性能的影响。具体而言,我们评估了预掩码与在小批量内掩码、人工缺失的增强与叠加、以及预归一化与后归一化的效果。我们的研究结果表明,掩码策略深刻影响插补精度:动态掩码提供了稳健的增强效益,而叠加掩码能更好地模拟现实世界的缺失模式。复杂模型(如CSDI)对预处理配置表现出敏感性,而简单模型(如BRITS)则提供了稳定且高效的性能。我们强调,将预处理流程和掩码策略与数据集特征对齐,对于提升包括高缺失率在内的多样化条件下的鲁棒性至关重要。本研究为设计插补流程提供了可操作的见解,并强调了透明且全面的实验设计的必要性。