Addressing the reproducibility crisis in artificial intelligence through the validation of reported experimental results is a challenging task. It necessitates either the reimplementation of techniques or a meticulous assessment of papers for deviations from the scientific method and best statistical practices. To facilitate the validation of reported results, we have developed numerical techniques capable of identifying inconsistencies between reported performance scores and various experimental setups in machine learning problems, including binary/multiclass classification and regression. These consistency tests are integrated into the open-source package mlscorecheck, which also provides specific test bundles designed to detect systematically recurring flaws in various fields, such as retina image processing and synthetic minority oversampling.
翻译:人工智能领域通过验证已报告实验结果来应对可复现性危机是一项具有挑战性的任务。这需要要么重新实现相关技术,要么细致评估论文是否存在偏离科学方法和最佳统计实践的情况。为了促进对已报告结果的验证,我们开发了数值技术,能够识别机器学习问题(包括二分类/多分类和回归)中报告的性能分数与各种实验设置之间的不一致性。这些一致性测试被整合到开源包 mlscorecheck 中,该包还提供了专门的测试包,用于检测各领域(如视网膜图像处理和合成少数类过采样)中系统性地反复出现的缺陷。