While clinical trials are the state-of-the-art methods to assess the effect of new medication in a comparative manner, benchmarking in the field of medical image analysis is performed by so-called challenges. Recently, comprehensive analysis of multiple biomedical image analysis challenges revealed large discrepancies between the impact of challenges and quality control of the design and reporting standard. This work aims to follow up on these results and attempts to address the specific question of the reproducibility of the participants methods. In an effort to determine whether alternative interpretations of the method description may change the challenge ranking, we reproduced the algorithms submitted to the 2019 Robust Medical Image Segmentation Challenge (ROBUST-MIS). The leaderboard differed substantially between the original challenge and reimplementation, indicating that challenge rankings may not be sufficiently reproducible.
翻译:虽然临床试验是以比较方式评估新药效果的先进方法,但医学图像分析领域的基准测试是通过所谓的挑战赛进行的。近期,对多项生物医学图像分析挑战赛的综合分析揭示了挑战赛影响力与设计和报告标准质量控制之间的巨大差异。本研究旨在跟进这些发现,并试图解决参与者方法可复现性的具体问题。为探究方法描述的不同解释是否可能改变挑战赛排名,我们复现了提交至2019年鲁棒医学图像分割挑战赛(ROBUST-MIS)的算法。原始挑战赛与复现结果之间的排行榜存在显著差异,表明挑战赛排名可能缺乏足够的可复现性。