Metrics reloaded: Recommendations for image analysis validation

Lena Maier-Hein,Annika Reinke,Patrick Godau,Minu D. Tizabi,Florian Buettner,Evangelia Christodoulou,Ben Glocker,Fabian Isensee,Jens Kleesiek,Michal Kozubek,Mauricio Reyes,Michael A. Riegler,Manuel Wiesenfarth,A. Emre Kavur,Carole H. Sudre,Michael Baumgartner,Matthias Eisenmann,Doreen Heckmann-Nötzel,Tim Rädsch,Laura Acion,Michela Antonelli,Tal Arbel,Spyridon Bakas,Arriel Benis,Matthew Blaschko,M. Jorge Cardoso,Veronika Cheplygina,Beth A. Cimini,Gary S. Collins,Keyvan Farahani,Luciana Ferrer,Adrian Galdran,Bram van Ginneken,Robert Haase,Daniel A. Hashimoto,Michael M. Hoffman,Merel Huisman,Pierre Jannin,Charles E. Kahn,Dagmar Kainmueller,Bernhard Kainz,Alexandros Karargyris,Alan Karthikesalingam,Hannes Kenngott,Florian Kofler,Annette Kopp-Schneider,Anna Kreshuk,Tahsin Kurc,Bennett A. Landman,Geert Litjens,Amin Madani,Klaus Maier-Hein,Anne L. Martel,Peter Mattson,Erik Meijering,Bjoern Menze,Karel G. M. Moons,Henning Müller,Brennan Nichyporuk,Felix Nickel,Jens Petersen,Nasir Rajpoot,Nicola Rieke,Julio Saez-Rodriguez,Clara I. Sánchez,Shravya Shetty,Maarten van Smeden,Ronald M. Summers,Abdel A. Taha,Aleksei Tiulpin,Sotirios A. Tsaftaris,Ben Van Calster,Gaël Varoquaux,Paul F. Jäger

from arxiv, Shared first authors: Lena Maier-Hein, Annika Reinke. arXiv admin note: substantial text overlap with arXiv:2104.05642 Published in Nature Methods

Increasing evidence shows that flaws in machine learning (ML) algorithm validation are an underestimated global problem. Particularly in automatic biomedical image analysis, chosen performance metrics often do not reflect the domain interest, thus failing to adequately measure scientific progress and hindering translation of ML techniques into practice. To overcome this, our large international expert consortium created Metrics Reloaded, a comprehensive framework guiding researchers in the problem-aware selection of metrics. Following the convergence of ML methodology across application domains, Metrics Reloaded fosters the convergence of validation methodology. The framework was developed in a multi-stage Delphi process and is based on the novel concept of a problem fingerprint - a structured representation of the given problem that captures all aspects that are relevant for metric selection, from the domain interest to the properties of the target structure(s), data set and algorithm output. Based on the problem fingerprint, users are guided through the process of choosing and applying appropriate validation metrics while being made aware of potential pitfalls. Metrics Reloaded targets image analysis problems that can be interpreted as a classification task at image, object or pixel level, namely image-level classification, object detection, semantic segmentation, and instance segmentation tasks. To improve the user experience, we implemented the framework in the Metrics Reloaded online tool, which also provides a point of access to explore weaknesses, strengths and specific recommendations for the most common validation metrics. The broad applicability of our framework across domains is demonstrated by an instantiation for various biological and medical image analysis use cases.

翻译：越来越多的证据表明，机器学习算法验证中的缺陷是一个被低估的全球性问题。特别是在自动生物医学图像分析领域，所选性能指标往往无法反映领域研究重点，导致难以充分衡量科学进展，并阻碍机器学习技术向实践转化。为克服这一挑战，我们组建的大型国际专家联盟开发了"度量衡重构"框架——这一综合框架可引导研究者从问题意识出发选择指标。随着机器学习方法论在不同应用领域的趋同，"度量衡重构"也推动了验证方法论的融合。该框架通过多阶段德尔菲法开发，基于新颖的"问题指纹"概念——即对给定问题进行结构化表征，捕捉从领域重点到目标结构、数据集及算法输出特性等所有与指标选择相关的要素。用户可根据问题指纹，在规避潜在陷阱的同时，循序渐进地完成验证指标的选择与实施全过程。"度量衡重构"针对可解释为图像级、目标级或像素级分类任务的图像分析问题，具体涵盖图像级分类、目标检测、语义分割和实例分割任务。为提升用户体验，我们将其实现为在线工具，提供主流验证指标优劣势分析及针对性建议的访问入口。通过生物与医学图像分析多个应用场景的实例化验证，充分证明了该框架的跨领域广泛适用性。