Use of Equivalent Relative Utility (ERU) to Evaluate Artificial Intelligence-Enabled Rule-Out Devices

We investigated the use of equivalent relative utility (ERU) to evaluate the effectiveness of artificial intelligence (AI)-enabled rule-out devices that use AI to identify and autonomously remove non-cancer patient images from radiologist review in screening mammography.We reviewed two performance metrics that can be used to compare the diagnostic performance between the radiologist-with-rule-out-device and radiologist-without-device workflows: positive/negative predictive values (PPV/NPV) and equivalent relative utility (ERU). To demonstrate the use of the two evaluation metrics, we applied both methods to a recent US-based study that reported an improved performance of the radiologist-with-device workflow compared to the one without the device by retrospectively applying their AI algorithm to a large mammography dataset. We further applied the ERU method to a European study utilizing their reported recall rates and cancer detection rates at different thresholds of their AI algorithm to compare the potential utility among different thresholds. For the study using US data, neither the PPV/NPV nor the ERU method can conclude a significant improvement in diagnostic performance for any of the algorithm thresholds reported. For the study using European data, ERU values at lower AI thresholds are found to be higher than that at a higher threshold because more false-negative cases would be ruled-out at higher threshold, reducing the overall diagnostic performance. Both PPV/NPV and ERU methods can be used to compare the diagnostic performance between the radiologist-with-device workflow and that without. One limitation of the ERU method is the need to measure the baseline, standard-of-care relative utility (RU) value for mammography screening in the US. Once the baseline value is known, the ERU method can be applied to large US datasets without knowing the true prevalence of the dataset.

翻译：我们研究了利用等效相对效用（equivalent relative utility, ERU）评估人工智能驱动排除设备的有效性，这类设备通过AI识别并自动移除筛查乳腺X线摄影中放射科医师审阅的非癌症患者图像。我们回顾了两种可用于比较“放射科医师+排除设备”工作流程与“放射科医师无设备”工作流程诊断性能的指标：阳性/阴性预测值（positive/negative predictive values, PPV/NPV）与等效相对效用（ERU）。为演示这两种评估指标的应用，我们将两种方法应用于最近一项基于美国的研究，该研究通过将其AI算法回顾性应用于大型乳腺X线摄影数据集，报告了“放射科医师+设备”工作流程相较于无设备工作流程的性能提升。我们进一步将ERU方法应用于一项欧洲研究，利用其在AI算法不同阈值下报告的召回率与癌症检出率，比较不同阈值间的潜在效用。对于使用美国数据的研究，PPV/NPV与ERU方法均无法得出在报告的任何算法阈值下诊断性能显著提升的结论。对于使用欧洲数据的研究，较低AI阈值下的ERU值高于较高阈值，这是因为较高阈值会排除更多假阴性病例，从而降低整体诊断性能。PPV/NPV与ERU方法均可用于比较“放射科医师+设备”工作流程与无设备工作流程的诊断性能。ERU方法的一个局限性在于需测量美国乳腺X线摄影筛查的基线标准护理相对效用（relative utility, RU）值。一旦基线值已知，ERU方法即可应用于美国大型数据集，而无需获知数据集的真实患病率。