We introduce molFTP (molecular fragment-target prevalence), a compact representation that delivers strong predictive performance. To prevent feature leakage across cross-validation folds, we implement a dummy-masking procedure that removes information about fragments present in the held-out molecules. We further show that key leave-one-out (key-loo) closely approximates true molecule-level leave-one-out (LOO), with deviation below 8% on our datasets. This enables near full data training while preserving unbiased cross-validation estimates of model performance. Overall, molFTP provides a fast, leakage-resistant fragment-target prevalence vectorization with practical safeguards (dummy masking or key-LOO) that approximate LOO at a fraction of its cost.
翻译:本文介绍了一种紧凑的分子表示方法——分子片段-靶标流行度(molFTP),该表示展现出优异的预测性能。为防止特征在交叉验证折间发生泄漏,我们实施了虚拟掩蔽程序,以消除保留分子中存在的片段信息。进一步研究表明,关键留一法(key-LOO)能够紧密逼近真实的分子级留一法(LOO),在我们的数据集上偏差低于8%。该方法使得在近乎全数据训练的同时,仍能保持模型性能的无偏交叉验证估计。总体而言,molFTP提供了一种快速、抗泄漏的片段-靶标流行度向量化方案,其通过实用保护措施(虚拟掩蔽或关键留一法)以极低的计算成本实现了对LOO的有效近似。