现实世界缺失值插补是否需要数十种方法？ (Do we Need Dozens of Methods for Real World Missing Value Imputation?) - 专知论文

会员服务 ·

0

缺失值 · 算法 · 基准 · 基准测试 · 分析 ·

2025 年 11 月 6 日

Do we Need Dozens of Methods for Real World Missing Value Imputation?

翻译：现实世界缺失值插补是否需要数十种方法？

Krystyna Grzesiak,Christophe Muller,Julie Josse,Jeffrey Näf

Missing values pose a persistent challenge in modern data science. Consequently, there is an ever-growing number of publications introducing new imputation methods in various fields. While many studies compare imputation approaches, they often focus on a limited subset of algorithms and evaluate performance primarily through pointwise metrics such as RMSE, which are not suitable to measure the preservation of the true data distribution. In this work, we provide a systematic benchmarking method based on the idea of treating imputation as a distributional prediction task. We consider a large number of algorithms and, for the first time, evaluate them not only on synthetic missing mechanisms, but also on real-world missingness scenarios, using the concept of Imputation Scores. Finally, while the focus of previous benchmark has often been on numerical data, we also consider mixed data sets in our study. The analysis overwhelmingly confirms the superiority of iterative imputation algorithms, especially the methods implemented in the mice R package.

翻译：缺失值在现代数据科学中构成持续挑战。因此，各领域引入新插补方法的出版物数量不断增长。尽管许多研究比较了不同插补方法，但它们通常仅关注有限的算法子集，并主要通过均方根误差（RMSE）等点状指标评估性能，这些指标并不适用于衡量真实数据分布的保持程度。本研究基于将插补视为分布预测任务的理念，提出了一种系统性基准测试方法。我们考察了大量算法，并首次不仅通过合成缺失机制，还利用真实世界缺失场景下的插补评分概念进行评估。此外，以往基准测试多聚焦于数值数据，本研究亦将混合数据集纳入分析范围。分析结果显著证实了迭代插补算法的优越性，特别是mice R包中实现的方法。

0

相关内容

缺失值

【NeurIPS2024】超越冗余：信息感知的无监督多重图结构学习

【NeurIPS2024】超越冗余：信息感知的无监督多重图结构学习

专知会员服务

28+阅读 · 2024年9月29日

【KDD2024】面向鲁棒推荐的决策边界感知图对比学习

【KDD2024】面向鲁棒推荐的决策边界感知图对比学习

专知会员服务

21+阅读 · 2024年8月8日

非Transformer不可？最新《状态空间模型（SSM）》综述

非Transformer不可？最新《状态空间模型（SSM）》综述

专知会员服务

75+阅读 · 2024年4月16日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

我们真的需要深度学习模型来预测时间序列吗? Do We Really Need Deep Learning Models for Time Series Forecasting?

我们真的需要深度学习模型来预测时间序列吗? Do We Really Need Deep Learning Models for Time Series Forecasting?

专知会员服务

37+阅读 · 2022年3月13日

【AAAI 2022】 GeomGCL:用于分子性质预测的几何图对比学习

【AAAI 2022】 GeomGCL:用于分子性质预测的几何图对比学习

专知会员服务

24+阅读 · 2022年2月27日

【AAAI2022】通过多任务学习改进证据深度学习

【AAAI2022】通过多任务学习改进证据深度学习

专知会员服务

20+阅读 · 2021年12月21日

【CVPR 2021】变换器跟踪TransT: Transformer Tracking

【CVPR 2021】变换器跟踪TransT: Transformer Tracking

专知会员服务

22+阅读 · 2021年4月20日

AAAI 2021 | 稀疏胜负多智能体博弈中的纳什均衡解计算

专知会员服务

41+阅读 · 2021年2月12日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

【CVPR 2020 Oral】小样本类增量学习

【CVPR 2020 Oral】小样本类增量学习

专知

20+阅读 · 2020年6月26日

【CVPR2020-清华大学】渐进对抗网络的细粒度域适应，Progressive Adversarial Networks

【CVPR2020-清华大学】渐进对抗网络的细粒度域适应，Progressive Adversarial Networks

专知

31+阅读 · 2020年4月4日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

不确定分数阶非线性系统Mittag-Leffler自适应控制

国家自然科学基金

1+阅读 · 2016年12月31日

关于随机MAX SAT和(2+p)-SAT模型可满足阈值的研究

国家自然科学基金

0+阅读 · 2015年12月31日

密码函数二阶非线性度快速算法及其紧下界研究

国家自然科学基金

0+阅读 · 2015年12月31日

非参数核方法的样本外扩展研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

多组分格子波尔兹曼方法的数值分析

国家自然科学基金

0+阅读 · 2014年12月31日

提高移动最小二乘近似无网格方法计算效率的技术和理论

国家自然科学基金

0+阅读 · 2014年12月31日

Biot模型基于有限元离散的多重网格算法研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于狄利克雷过程的潜变量模型贝叶斯半参数分析

国家自然科学基金

2+阅读 · 2014年12月31日

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

175+阅读 · 2023年4月20日

A Comprehensive Survey on Deep Graph Representation Learning

Arxiv

109+阅读 · 2023年4月11日

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Arxiv

231+阅读 · 2023年4月7日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

499+阅读 · 2023年3月31日

Data-centric Artificial Intelligence: A Survey

Arxiv

27+阅读 · 2023年3月17日

Deep Learning for Time Series Anomaly Detection: A Survey

Arxiv

21+阅读 · 2022年11月9日

A Survey of Deep Learning for Low-Shot Object Detection

Arxiv

21+阅读 · 2021年12月6日

Event Extraction with Generative Adversarial Imitation Learning

Arxiv

13+阅读 · 2018年4月21日

Deep Metric Learning with BIER: Boosting Independent Embeddings Robustly

Arxiv

18+阅读 · 2018年1月15日

VIP会员

文章信息

相关主题

相关VIP内容

【NeurIPS2024】超越冗余：信息感知的无监督多重图结构学习

【NeurIPS2024】超越冗余：信息感知的无监督多重图结构学习

专知会员服务

28+阅读 · 2024年9月29日

【KDD2024】面向鲁棒推荐的决策边界感知图对比学习

【KDD2024】面向鲁棒推荐的决策边界感知图对比学习

专知会员服务

21+阅读 · 2024年8月8日

非Transformer不可？最新《状态空间模型（SSM）》综述

非Transformer不可？最新《状态空间模型（SSM）》综述

专知会员服务

75+阅读 · 2024年4月16日

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

【CVPR 2022】基于实例深度估计的统一深度感知全景分割 PanopticDepth: Per-Instance Depth Estimation for Unified Depth-Aware Panoptic Segmentation

专知会员服务

18+阅读 · 2022年3月19日

我们真的需要深度学习模型来预测时间序列吗? Do We Really Need Deep Learning Models for Time Series Forecasting?

我们真的需要深度学习模型来预测时间序列吗? Do We Really Need Deep Learning Models for Time Series Forecasting?

专知会员服务

37+阅读 · 2022年3月13日

【AAAI 2022】 GeomGCL:用于分子性质预测的几何图对比学习

【AAAI 2022】 GeomGCL:用于分子性质预测的几何图对比学习

专知会员服务

24+阅读 · 2022年2月27日

【AAAI2022】通过多任务学习改进证据深度学习

【AAAI2022】通过多任务学习改进证据深度学习

专知会员服务

20+阅读 · 2021年12月21日

【CVPR 2021】变换器跟踪TransT: Transformer Tracking

【CVPR 2021】变换器跟踪TransT: Transformer Tracking

专知会员服务

22+阅读 · 2021年4月20日

AAAI 2021 | 稀疏胜负多智能体博弈中的纳什均衡解计算

专知会员服务

41+阅读 · 2021年2月12日

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

语义相似性算法演化论文，29页pdf，Evolution of Semantic Similarity - A Survey

专知会员服务

44+阅读 · 2020年4月30日

热门VIP内容

开通专知VIP会员享更多权益服务

《无人机与战争：被忽视的环境影响及无人机保护潜力》

俄罗斯规划未来无人机驱动军队

《整合杀伤链：一个用于边缘目标验证与战术推理的零样本框架》最新资料

《人工智能、武器与影响力：前沿模型在模拟核危机中展现复杂推理》2026最新46页报告

相关资讯

AAAI 2022 | ProtGNN：自解释图神经网络

AAAI 2022 | ProtGNN：自解释图神经网络

专知

10+阅读 · 2022年2月28日

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

最新最全《深度元学习》2021综述论文，68页pdf，A Survey of Deep Meta-Learning

专知

11+阅读 · 2021年4月23日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

Python图像处理，366页pdf，Image Operators Image Processing in Python

Python图像处理，366页pdf，Image Operators Image Processing in Python

专知

15+阅读 · 2020年7月23日

【CVPR 2020 Oral】小样本类增量学习

【CVPR 2020 Oral】小样本类增量学习

专知

20+阅读 · 2020年6月26日

【CVPR2020-清华大学】渐进对抗网络的细粒度域适应，Progressive Adversarial Networks

【CVPR2020-清华大学】渐进对抗网络的细粒度域适应，Progressive Adversarial Networks

专知

31+阅读 · 2020年4月4日

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

【CVPR2020-旷视】DPGN：分布传播图网络的小样本学习

专知

13+阅读 · 2020年4月1日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【NeurIPS2019】图变换网络：Graph Transformer Network

【NeurIPS2019】图变换网络：Graph Transformer Network

专知

245+阅读 · 2019年11月18日

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

论文浅尝 | 当知识图谱遇上零样本学习——零样本学习综述

开放知识图谱

22+阅读 · 2018年9月26日

相关论文

DeepSeek-V3 Technical Report

Arxiv

18+阅读 · 2024年12月27日

Is ChatGPT a Good Recommender? A Preliminary Study

Arxiv

175+阅读 · 2023年4月20日

A Comprehensive Survey on Deep Graph Representation Learning

Arxiv

109+阅读 · 2023年4月11日

On Efficient Training of Large-Scale Deep Learning Models: A Literature Review

Arxiv

231+阅读 · 2023年4月7日

A Survey of Large Language Models

A Survey of Large Language Models

Arxiv

499+阅读 · 2023年3月31日

Data-centric Artificial Intelligence: A Survey

Arxiv

27+阅读 · 2023年3月17日

Deep Learning for Time Series Anomaly Detection: A Survey

Arxiv

21+阅读 · 2022年11月9日

A Survey of Deep Learning for Low-Shot Object Detection

Arxiv

21+阅读 · 2021年12月6日

Event Extraction with Generative Adversarial Imitation Learning

Arxiv

13+阅读 · 2018年4月21日

Deep Metric Learning with BIER: Boosting Independent Embeddings Robustly

Arxiv

18+阅读 · 2018年1月15日

相关基金

不确定分数阶非线性系统Mittag-Leffler自适应控制

国家自然科学基金

1+阅读 · 2016年12月31日

关于随机MAX SAT和(2+p)-SAT模型可满足阈值的研究

国家自然科学基金

0+阅读 · 2015年12月31日

密码函数二阶非线性度快速算法及其紧下界研究

国家自然科学基金

0+阅读 · 2015年12月31日

非参数核方法的样本外扩展研究

国家自然科学基金

2+阅读 · 2015年12月31日

基于自主学习的Ad hoc Agent序贯决策研究

国家自然科学基金

46+阅读 · 2015年12月31日

Schr？dinger-Poisson方程守恒DDG方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

多组分格子波尔兹曼方法的数值分析

国家自然科学基金

0+阅读 · 2014年12月31日

提高移动最小二乘近似无网格方法计算效率的技术和理论

国家自然科学基金

0+阅读 · 2014年12月31日

Biot模型基于有限元离散的多重网格算法研究

国家自然科学基金

1+阅读 · 2014年12月31日

基于狄利克雷过程的潜变量模型贝叶斯半参数分析

国家自然科学基金

2+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员