A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data-stemming from the onerous and costly nature of experimental validation-further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to learn how to generalize beyond the training distribution. We demonstrate significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting our interpolation method.
翻译:分子预测模型的一个公认局限性在于其过度依赖训练数据中观察到的结构,导致对分布外化合物的泛化能力较差。然而在药物发现领域,对研究进展最为关键的化合物往往位于训练集之外,这使得模型对训练数据的偏向性尤为突出。这种不匹配引入了显著的协变量偏移,在此情况下标准深度学习模型会产生不稳定且不准确的预测。此外,由于实验验证过程繁重且成本高昂,标记数据的稀缺性进一步加剧了实现可靠泛化的难度。为应对这些局限性,我们提出了一种新颖的双层优化方法,该方法利用未标记数据在分布内数据与分布外数据之间进行插值,使模型能够学习如何超越训练分布进行泛化。通过在具有显著协变量偏移的挑战性真实数据集上展示显著的性能提升,并辅以t-SNE可视化技术突显我们插值方法的优势,验证了该方法的有效性。