Metrics reloaded: Recommendations for image analysis validation

Lena Maier-Hein,Annika Reinke,Patrick Godau,Minu D. Tizabi,Florian Büttner,Evangelia Christodoulou,Ben Glocker,Fabian Isensee,Jens Kleesiek,Michal Kozubek,Mauricio Reyes,Michael A. Riegler,Manuel Wiesenfarth,A. Emre Kavur,Carole H. Sudre,Michael Baumgartner,Matthias Eisenmann,Doreen Heckmann-Nötzel,A. Tim Rädsch,Laura Acion,Michela Antonelli,Tal Arbel,Spyridon Bakas,Arriel Benis,Matthew Blaschko,M. Jorge Cardoso,Veronika Cheplygina,Beth A. Cimini,Gary S. Collins,Keyvan Farahani,Luciana Ferrer,Adrian Galdran,Bram van Ginneken,Robert Haase,Daniel A. Hashimoto,Michael M. Hoffman,Merel Huisman,Pierre Jannin,Charles E. Kahn,Dagmar Kainmueller,Bernhard Kainz,Alexandros Karargyris,Alan Karthikesalingam,Hannes Kenngott,Florian Kofler,Annette Kopp-Schneider,Anna Kreshuk,Tahsin Kurc,Bennett A. Landman,Geert Litjens,Amin Madani,Klaus Maier-Hein,Anne L. Martel,Peter Mattson,Erik Meijering,Bjoern Menze,Karel G. M. Moons,Henning Müller,Brennan Nichyporuk,Felix Nickel,Jens Petersen,Nasir Rajpoot,Nicola Rieke,Julio Saez-Rodriguez,Clara I. Sánchez,Shravya Shetty,Maarten van Smeden,Ronald M. Summers,Abdel A. Taha,Aleksei Tiulpin,Sotirios A. Tsaftaris,Ben Van Calster,Gaël Varoquaux,Paul F. Jäger

from arxiv, Shared first authors: Lena Maier-Hein, Annika Reinke

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.

翻译：越来越多的证据表明，机器学习算法验证中的缺陷是一个被低估的全球性问题。尤其在自动生物医学图像分析中，所选性能指标往往未能反映领域关注点，从而无法充分衡量科学进展，并阻碍了机器学习技术向实践的转化。为克服这一挑战，我们组建了大型国际专家联盟，创建了"度量衡再校准"（Metrics Reloaded）框架，该综合框架旨在指导研究人员进行问题感知型指标选择。遵循机器学习方法在各应用领域的趋同趋势，"度量衡再校准"推动了验证方法论的同化。该框架通过多阶段德尔菲法开发，基于新颖的"问题指纹"概念——一种对给定问题的结构化表征，囊括了从领域关注点到目标结构特性、数据集及算法输出等所有与指标选择相关的要素。基于问题指纹，用户将获得从筛选到应用恰当验证指标的全程指导，并同步了解潜在陷阱。"度量衡再校准"聚焦于可解释为图像级、目标级或像素级分类任务的图像分析问题，具体包括图像级分类、目标检测、语义分割及实例分割任务。为提升用户体验，我们已在"度量衡再校准"在线工具中实现该框架，该工具亦为探索常见验证指标的弱点、优势及具体建议提供了切入点。通过多种生物医学图像分析用例的实例化，验证了本框架跨领域的广泛适用性。