Data leakage remains a recurrent source of optimistic bias in biomedical machine learning studies. Standard row-wise cross-validation and globally estimated preprocessing steps are often inappropriate for data with repeated measurements, study-level heterogeneity, batch effects, or temporal dependencies. This paper describes bioLeak, an R package for constructing leakage-aware resampling workflows and for auditing fitted models for common leakage mechanisms. The package provides leakage-aware split construction, train-fold-only preprocessing, cross-validated model fitting, nested hyperparameter tuning, post hoc leakage audits, and HTML reporting. The implementation supports binary classification, multiclass classification, regression, and survival analysis, with task-specific metrics and S4 containers for splits, fits, audits, and inflation summaries. The simulation artifacts show how apparent performance changes under controlled leakage mechanisms, and the case study illustrates how guarded and leaky pipelines can yield materially different conclusions on multi-study transcriptomic data. The emphasis throughout is on software design, reproducible workflows, and interpretation of diagnostic output.
翻译:数据泄漏仍是生物医学机器学习研究中乐观偏差的反复来源。针对含重复测量、研究层面异质性、批次效应或时间依赖性的数据,标准逐行交叉验证及全局估计的预处理方法往往不适用。本文介绍R语言包bioLeak,该工具包可构建泄漏感知的重采样工作流,并审计常见泄漏机制下的已拟合模型。该实现支持泄漏感知的数据划分构建、仅基于训练折叠的预处理、交叉验证模型拟合、嵌套超参数调优、事后泄漏审计及HTML报告生成。工具包涵盖二分类、多分类、回归和生存分析任务,配备任务特异性评估指标及用于划分、拟合、审计和膨胀汇总的S4容器。仿真实验揭示了在受控泄漏机制下模型表观性能的变化规律,案例研究则展示了在多研究转录组数据上,受防护管道与存在泄漏的管道如何导致截然不同的结论。本文重点阐述软件设计、可复现工作流及诊断输出的解读方法。