Metrics reloaded: Pitfalls and recommendations for image analysis validation

Lena Maier-Hein,Annika Reinke,Patrick Godau,Minu D. Tizabi,Florian Büttner,Evangelia Christodoulou,Ben Glocker,Fabian Isensee,Jens Kleesiek,Michal Kozubek,Mauricio Reyes,Michael A. Riegler,Manuel Wiesenfarth,Emre Kavur,Carole H. Sudre,Michael Baumgartner,Matthias Eisenmann,Doreen Heckmann-Nötzel,A. Tim Rädsch,Laura Acion,Michela Antonelli,Tal Arbel,Spyridon Bakas,Arriel Benis,Matthew Blaschko,M. Jorge Cardoso,Veronika Cheplygina,Beth A. Cimini,Gary S. Collins,Keyvan Farahani,Luciana Ferrer,Adrian Galdran,Bram van Ginneken,Robert Haase,Daniel A. Hashimoto,Michael M. Hoffman,Merel Huisman,Pierre Jannin,Charles E. Kahn,Dagmar Kainmueller,Bernhard Kainz,Alexandros Karargyris,Alan Karthikesalingam,Hannes Kenngott,Florian Kofler,Annette Kopp-Schneider,Anna Kreshuk,Tahsin Kurc,Bennett A. Landman,Geert Litjens,Amin Madani,Klaus Maier-Hein,Anne L. Martel,Peter Mattson,Erik Meijering,Bjoern Menze,Karel G. M. Moons,Henning Müller,Brennan Nichyporuk,Felix Nickel,Jens Petersen,Nasir Rajpoot,Nicola Rieke,Julio Saez-Rodriguez,Clara I. Sánchez,Shravya Shetty,Maarten van Smeden,Ronald M. Summers,Abdel A. Taha,Aleksei Tiulpin,Sotirios A. Tsaftaris,Ben Van Calster,Gaël Varoquaux,Paul F. Jäger

from arxiv, Shared first authors: Lena Maier-Hein, Annika Reinke

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.

翻译：越来越多的证据表明，机器学习算法验证中的缺陷是一个被低估的全球性问题。尤其在生物医学图像自动分析领域，所选性能指标往往无法反映领域核心关切，因而难以充分衡量科学进展，并阻碍了机器学习技术向实际应用的转化。为解决这一问题，我们组建的国际大型专家联盟创建了"指标重载"框架，该综合性框架可指导研究者根据问题特性进行指标选择。随着机器学习方法在不同应用领域间的趋同，指标重载框架也推动了验证方法学的统一化。该框架通过多阶段德尔菲过程开发，基于"问题指纹"这一创新概念——即对给定问题的结构化表征，涵盖从领域核心关切到目标结构属性、数据集特征及算法输出等所有与指标选择相关的要素。用户可依据问题指纹，在明确潜在陷阱的前提下，系统完成适当验证指标的选择与应用流程。指标重载适用于可解读为图像级、对象级或像素级分类任务的图像分析问题，具体包括：图像级分类、目标检测、语义分割及实例分割任务。为提升用户体验，我们在"指标重载"在线工具中实现了该框架，该工具同时提供探索常见验证指标优缺点的入口及针对性推荐。通过生物医学图像分析多个应用场景的实例验证，我们证实了本框架的跨领域广泛适用性。