The quantitative structure-activity relationship (QSAR) regression model is a commonly used technique for predicting biological activities of compounds using their molecular descriptors. Predictions from QSAR models can help, for example, to optimize molecular structure; prioritize compounds for further experimental testing; and estimate their toxicity. In addition to the accurate estimation of the activity, it is highly desirable to obtain some estimate of the uncertainty associated with the prediction, e.g., calculate a prediction interval (PI) containing the true molecular activity with a pre-specified probability, say 70%, 90% or 95%. The challenge is that most machine learning (ML) algorithms that achieve superior predictive performance require some add-on methods for estimating uncertainty of their prediction. The development of these algorithms is an active area of research by statistical and ML communities but their implementation for QSAR modeling remains limited. Conformal prediction (CP) is a promising approach. It is agnostic to the prediction algorithm and can produce valid prediction intervals under some weak assumptions on the data distribution. We proposed computationally efficient CP algorithms tailored to the most advanced ML models, including Deep Neural Networks and Gradient Boosting Machines. The validity and efficiency of proposed conformal predictors are demonstrated on a diverse collection of QSAR datasets as well as simulation studies.
翻译:定量构效关系(QSAR)回归模型是一种常用技术,利用化合物的分子描述符预测其生物活性。QSAR模型的预测结果可用于优化分子结构、优先筛选化合物进行进一步实验测试,以及评估其毒性等。除准确估计活性外,获得预测相关的不确定性估计也极具价值,例如计算包含真实分子活性的预测区间(PI),并预设概率(如70%、90%或95%)。挑战在于,大多数具有卓越预测性能的机器学习(ML)算法需要附加方法才能估计其预测的不确定性。这类算法的开发是统计学和ML领域的热点研究方向,但其在QSAR建模中的应用仍十分有限。共形预测(CP)是一种颇具前景的方法。它与预测算法无关,且能在数据分布的弱假设下生成有效的预测区间。我们提出了专为最先进ML模型(包括深度神经网络和梯度提升机)设计的计算高效型CP算法。通过在多样化QSAR数据集及模拟研究中的验证,证明了所提共形预测器的有效性与效率。