Evaluating the role of correlation among markers in prediction models

Different methods have been employed to estimate models maximizing the area under the receiver operating characteristic curve (ROC-AUC). Once a model is developed, integrating novel biomarkers may improve its diagnostic ability. However, the discrimination improvement from adding a new biomarker is not always evident, even if the marker itself has good discriminatory power. The sign and magnitude of correlations between biomarkers may impact model performance. In this paper, we assess the effect of such correlations on the discrimination ability of predictive models. Under multivariate normality, we derive an expression for the maximum AUC as a function of the correlations between markers, illustrated graphically using surfaces. Logarithmic folded bivariate normal and Gamma simulations address skewed data cases. Additionally, AUC improvement was assessed combining 1934 blood lipid metabolites determined by liquid chromatography in 44 pancreatic cancer cases and 38 controls from the PanGenMic Study. Our results show that negative correlations consistently maximize the combined AUC, offering the greatest improvements when markers have equal predictive ability, while positive correlations yield the least favorable results. Negative correlations remain optimal for markers with differing abilities, though positive correlations show slight benefits. Simulations with skewed distributions confirm these trends, emphasizing the role of asymmetry in marker selection. Real-world analysis of serum lipid-derived metabolites for detecting pancreatic ductal adenocarcinoma (PDAC) reinforces the influence of correlations on AUC optimization. These findings suggest that the sign and magnitude of inter-biomarker correlations should be considered when incorporating new markers into predictive algorithms.

翻译：不同方法已被用于估计最大化受试者工作特征曲线下面积（ROC-AUC）的模型。一旦模型建立完成，整合新型生物标志物可能提升其诊断能力。然而，即使新增标志物本身具有良好的判别能力，其加入带来的判别力提升也并非总是显而易见。生物标志物间相关性的符号和强度可能影响模型性能。本文评估了此类相关性对预测模型判别能力的影响。在多元正态性假设下，我们推导出最大AUC作为标记物间相关性函数的表达式，并通过曲面图进行可视化展示。针对偏态数据情形，采用了对数折叠二元正态分布和伽马分布的模拟分析。此外，基于PanGenMic研究中44例胰腺癌病例和38例对照的1934种液相色谱测定的血脂代谢物数据，评估了AUC的改善情况。结果表明，负相关性能够持续最大化联合AUC，当标记物具有同等预测能力时改善最为显著，而正相关性则产生最差结果。对于能力不同的标记物，负相关性仍保持最优性能，尽管正相关性会呈现微弱优势。偏态分布模拟验证了这些趋势，凸显了标记物选择中不对称性的作用。利用血清脂质代谢物检测胰腺导管腺癌（PDAC）的真实世界分析进一步印证了相关性对AUC优化的影响。这些发现表明，在将新标记物纳入预测算法时，应考虑生物标志物间相关性的符号和强度。