Machine learning (ML) provides powerful tools for predictive modeling. ML's popularity stems from the promise of sample-level prediction with applications across a variety of fields from physics and marketing to healthcare. However, if not properly implemented and evaluated, ML pipelines may contain leakage typically resulting in overoptimistic performance estimates and failure to generalize to new data. This can have severe negative financial and societal implications. Our aim is to expand understanding associated with causes leading to leakage when designing, implementing, and evaluating ML pipelines. Illustrated by concrete examples, we provide a comprehensive overview and discussion of various types of leakage that may arise in ML pipelines.
翻译:机器学习(ML)为预测建模提供了强大的工具。ML之所以广受欢迎,源于其在不同领域(从物理学、市场营销到医疗健康)中实现样本级预测的应用前景。然而,若未进行适当实施与评估,ML流水线可能含有数据泄露,通常会导致过度乐观的性能评估结果,并无法泛化到新数据。这可能造成严重的社会与经济损失。本文旨在深化对设计、实施和评估ML流水线时导致泄露原因的理解。通过具体示例,我们系统梳理并探讨了ML流水线中可能出现的各类数据泄露现象。