Metrics reloaded: Recommendations for image analysis validation

Lena Maier-Hein,Annika Reinke,Patrick Godau,Minu D. Tizabi,Florian Buettner,Evangelia Christodoulou,Ben Glocker,Fabian Isensee,Jens Kleesiek,Michal Kozubek,Mauricio Reyes,Michael A. Riegler,Manuel Wiesenfarth,A. Emre Kavur,Carole H. Sudre,Michael Baumgartner,Matthias Eisenmann,Doreen Heckmann-Nötzel,A. Tim Rädsch,Laura Acion,Michela Antonelli,Tal Arbel,Spyridon Bakas,Arriel Benis,Matthew Blaschko,M. Jorge Cardoso,Veronika Cheplygina,Beth A. Cimini,Gary S. Collins,Keyvan Farahani,Luciana Ferrer,Adrian Galdran,Bram van Ginneken,Robert Haase,Daniel A. Hashimoto,Michael M. Hoffman,Merel Huisman,Pierre Jannin,Charles E. Kahn,Dagmar Kainmueller,Bernhard Kainz,Alexandros Karargyris,Alan Karthikesalingam,Hannes Kenngott,Florian Kofler,Annette Kopp-Schneider,Anna Kreshuk,Tahsin Kurc,Bennett A. Landman,Geert Litjens,Amin Madani,Klaus Maier-Hein,Anne L. Martel,Peter Mattson,Erik Meijering,Bjoern Menze,Karel G. M. Moons,Henning Müller,Brennan Nichyporuk,Felix Nickel,Jens Petersen,Nasir Rajpoot,Nicola Rieke,Julio Saez-Rodriguez,Clara I. Sánchez,Shravya Shetty,Maarten van Smeden,Ronald M. Summers,Abdel A. Taha,Aleksei Tiulpin,Sotirios A. Tsaftaris,Ben Van Calster,Gaël Varoquaux,Paul F. Jäger

from arxiv, Shared first authors: Lena Maier-Hein, Annika Reinke. arXiv admin note: substantial text overlap with arXiv:2104.05642

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.

翻译：摘要：越来越多的证据表明，机器学习（ML）算法验证中的缺陷是一个被低估的全球性问题。尤其在自动生物医学图像分析领域，所选性能指标往往无法反映领域核心关注点，既未能充分衡量科学进展，也阻碍了ML技术向实际应用的转化。为克服这一挑战，我们组建了大型国际专家联盟，提出“指标评估”（Metrics Reloaded）框架——一套引导研究者基于问题意识选择指标的综合性体系。随着ML方法学在不同应用领域的趋同化，指标评估框架致力于推动验证方法学的统一化。该框架通过多轮德尔菲法开发，基于“问题指纹”（problem fingerprint）这一创新概念——以结构化形式表征特定问题，涵盖从领域兴趣到目标结构特性、数据集特征及算法输出的全部指标筛选要素。基于问题指纹，用户可循体系化流程选择并应用恰当的验证指标，同时了解潜在陷阱。指标评估框架适用于可被解释为图像层、对象层或像素层分类任务的图像分析问题，包括图像级分类、目标检测、语义分割及实例分割。为提升用户体验，我们将其部署为在线工具Metrics Reloaded，提供常见验证指标的弱点、优势及针对性推荐查询入口。通过多类生物医学图像分析用例的实例化验证，该框架的跨领域普适性得到了充分证明。