基于稠密化稀疏标记数据的稳健分子性质预测 (Robust Molecular Property Prediction via Densifying Scarce Labeled Data) - 专知论文

会员服务 ·

0

分子 · 泛化 · 稀疏 · 稳健 · 分子性质 ·

Robust Molecular Property Prediction via Densifying Scarce Labeled Data

翻译：基于稠密化稀疏标记数据的稳健分子性质预测

Jina Kim,Jeffrey Willette,Bruno Andreis,Sung Ju Hwang

A widely recognized limitation of molecular prediction models is their reliance on structures observed in the training data, resulting in poor generalization to out-of-distribution compounds. Yet in drug discovery, the compounds most critical for advancing research often lie beyond the training set, making the bias toward the training data particularly problematic. This mismatch introduces substantial covariate shift, under which standard deep learning models produce unstable and inaccurate predictions. Furthermore, the scarcity of labeled data-stemming from the onerous and costly nature of experimental validation-further exacerbates the difficulty of achieving reliable generalization. To address these limitations, we propose a novel bilevel optimization approach that leverages unlabeled data to interpolate between in-distribution (ID) and out-of-distribution (OOD) data, enabling the model to learn how to generalize beyond the training distribution. We demonstrate significant performance gains on challenging real-world datasets with substantial covariate shift, supported by t-SNE visualizations highlighting our interpolation method.

翻译：分子预测模型的一个公认局限性在于其过度依赖训练数据中观察到的结构，导致对分布外化合物的泛化能力较差。然而在药物发现领域，对研究进展最为关键的化合物往往位于训练集之外，这使得模型对训练数据的偏向性尤为突出。这种不匹配引入了显著的协变量偏移，在此情况下标准深度学习模型会产生不稳定且不准确的预测。此外，由于实验验证过程繁重且成本高昂，标记数据的稀缺性进一步加剧了实现可靠泛化的难度。为应对这些局限性，我们提出了一种新颖的双层优化方法，该方法利用未标记数据在分布内数据与分布外数据之间进行插值，使模型能够学习如何超越训练分布进行泛化。通过在具有显著协变量偏移的挑战性真实数据集上展示显著的性能提升，并辅以t-SNE可视化技术突显我们插值方法的优势，验证了该方法的有效性。

0

相关内容

【AAAI2026】《SimDiff：用于时间序列点预测的更简单但更优的扩散模型》

【AAAI2026】《SimDiff：用于时间序列点预测的更简单但更优的扩散模型》

专知会员服务

14+阅读 · 2025年11月25日

【MIT博士论文】基于数据的模型可靠性视角，322页pdf

【MIT博士论文】基于数据的模型可靠性视角，322页pdf

专知会员服务

39+阅读 · 2024年3月25日

大模型如何做药物发现？Mila等30多位作者发布《基础模型分子学习: 大规模多任务数据集》

大模型如何做药物发现？Mila等30多位作者发布《基础模型分子学习: 大规模多任务数据集》

专知会员服务

28+阅读 · 2023年10月10日

【牛津大学博士论文】用于姿态验证、亲和度预测和输入归因的深度神经网络，133页pdf

【牛津大学博士论文】用于姿态验证、亲和度预测和输入归因的深度神经网络，133页pdf

专知会员服务

13+阅读 · 2023年7月30日

主动学习预测结合自由能进行分子优化

主动学习预测结合自由能进行分子优化

专知会员服务

16+阅读 · 2022年9月18日

ATMOL：利用对比学习预训练模型预测分子性质

ATMOL：利用对比学习预训练模型预测分子性质

专知会员服务

12+阅读 · 2022年8月14日

Chem. Sci.｜MGraphDTA：基于深层多尺度图神经网络预测药物-靶标亲和力

Chem. Sci.｜MGraphDTA：基于深层多尺度图神经网络预测药物-靶标亲和力

专知会员服务

23+阅读 · 2022年7月22日

深度学习在分子生成和分子性质预测中的应用

深度学习在分子生成和分子性质预测中的应用

专知会员服务

36+阅读 · 2022年6月19日

【Alex Nowak-Vila博士论文】有理论保证的结构化预测， Structured Prediction with Theoretical Guarantees

【Alex Nowak-Vila博士论文】有理论保证的结构化预测， Structured Prediction with Theoretical Guarantees

专知会员服务

13+阅读 · 2022年3月15日

【WWW2021】少样本图学习分子性质预测

【WWW2021】少样本图学习分子性质预测

专知会员服务

36+阅读 · 2021年2月20日

「知识增强预训练语言模型」最新研究综述

「知识增强预训练语言模型」最新研究综述

专知

18+阅读 · 2022年11月18日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

专知

22+阅读 · 2020年3月14日

Dropout、梯度消失/爆炸、Adam优化算法，神经网络优化算法看这一篇就够了

Dropout、梯度消失/爆炸、Adam优化算法，神经网络优化算法看这一篇就够了

AI100

14+阅读 · 2019年9月1日

「PPT」深度学习中的不确定性估计

「PPT」深度学习中的不确定性估计

专知

27+阅读 · 2019年7月20日

你的算法可靠吗？神经网络不确定性度量

你的算法可靠吗？神经网络不确定性度量

专知

40+阅读 · 2019年4月27日

稀疏性的3个优势 -《稀疏统计学习及其应用》

稀疏性的3个优势 -《稀疏统计学习及其应用》

遇见数学

15+阅读 · 2018年10月24日

【干货】结合单阶段和两阶段目标检测的优势：基于单次精化神经网络的目标检测方法

【干货】结合单阶段和两阶段目标检测的优势：基于单次精化神经网络的目标检测方法

专知

12+阅读 · 2018年1月12日

回归预测&时间序列预测

回归预测&时间序列预测

GBASE数据工程部数据团队

44+阅读 · 2017年5月17日

含非正态及缺失数据的结构方程模型分析

国家自然科学基金

0+阅读 · 2015年12月31日

高维回归模型的预测稳定性研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于多生物网络的蛋白质功能预测算法研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于稳健估计方程的复杂纵向数据研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于结构与序列信息的蛋白质-配体结合位点的预测

国家自然科学基金

8+阅读 · 2015年12月31日

数据内在结构和稀疏保持的大间隔分类方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

复杂纵向数据的分位回归建模及其在生物医学大数据中的应用

国家自然科学基金

4+阅读 · 2015年12月31日

对具有非平衡多标签特性的蛋白质功能类型分类预测研究

国家自然科学基金

0+阅读 · 2014年12月31日

高维稀疏统计模型中的变量选择与检验

国家自然科学基金

1+阅读 · 2014年12月31日

蛋白质结构类预测中的特征信息提取与分类算法研究

国家自然科学基金

1+阅读 · 2014年12月31日

Predictive Synthesis under Sporadic Participation: Evidence from Inflation Density Surveys

Arxiv

0+阅读 · 2月5日

Cardinality-Preserving Structured Sparse Graph Transformers for Molecular Property Prediction

Arxiv

0+阅读 · 2月2日

Context-aware Graph Causality Inference for Few-Shot Molecular Property Prediction

Arxiv

0+阅读 · 1月16日

Rep3Net: An Approach Exploiting Multimodal Representation for Molecular Bioactivity Prediction

Arxiv

0+阅读 · 1月12日

Investigating Knowledge Distillation Through Neural Networks for Protein Binding Affinity Prediction

Arxiv

0+阅读 · 1月7日

Instructor-inspired Machine Learning for Robust Molecular Property Prediction

Arxiv

0+阅读 · 1月7日

Environment-Adaptive Covariate Selection: Learning When to Use Spurious Correlations for Out-of-Distribution Prediction

Arxiv

0+阅读 · 1月5日

Improving the accuracy and generalizability of molecular property regression models with a substructure-substitution-rule-informed framework

Arxiv

0+阅读 · 1月5日

Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model

Arxiv

0+阅读 · 1月2日

Revisiting Out-of-Distribution Detection in Real-time Object Detection: From Benchmark Pitfalls to a New Mitigation Paradigm

Arxiv

0+阅读 · 2025年12月31日

VIP会员

文章信息

相关主题

相关VIP内容

【AAAI2026】《SimDiff：用于时间序列点预测的更简单但更优的扩散模型》

【AAAI2026】《SimDiff：用于时间序列点预测的更简单但更优的扩散模型》

专知会员服务

14+阅读 · 2025年11月25日

【MIT博士论文】基于数据的模型可靠性视角，322页pdf

【MIT博士论文】基于数据的模型可靠性视角，322页pdf

专知会员服务

39+阅读 · 2024年3月25日

大模型如何做药物发现？Mila等30多位作者发布《基础模型分子学习: 大规模多任务数据集》

大模型如何做药物发现？Mila等30多位作者发布《基础模型分子学习: 大规模多任务数据集》

专知会员服务

28+阅读 · 2023年10月10日

【牛津大学博士论文】用于姿态验证、亲和度预测和输入归因的深度神经网络，133页pdf

【牛津大学博士论文】用于姿态验证、亲和度预测和输入归因的深度神经网络，133页pdf

专知会员服务

13+阅读 · 2023年7月30日

主动学习预测结合自由能进行分子优化

主动学习预测结合自由能进行分子优化

专知会员服务

16+阅读 · 2022年9月18日

ATMOL：利用对比学习预训练模型预测分子性质

ATMOL：利用对比学习预训练模型预测分子性质

专知会员服务

12+阅读 · 2022年8月14日

Chem. Sci.｜MGraphDTA：基于深层多尺度图神经网络预测药物-靶标亲和力

Chem. Sci.｜MGraphDTA：基于深层多尺度图神经网络预测药物-靶标亲和力

专知会员服务

23+阅读 · 2022年7月22日

深度学习在分子生成和分子性质预测中的应用

深度学习在分子生成和分子性质预测中的应用

专知会员服务

36+阅读 · 2022年6月19日

【Alex Nowak-Vila博士论文】有理论保证的结构化预测， Structured Prediction with Theoretical Guarantees

【Alex Nowak-Vila博士论文】有理论保证的结构化预测， Structured Prediction with Theoretical Guarantees

专知会员服务

13+阅读 · 2022年3月15日

【WWW2021】少样本图学习分子性质预测

【WWW2021】少样本图学习分子性质预测

专知会员服务

36+阅读 · 2021年2月20日

热门VIP内容

开通专知VIP会员享更多权益服务

【CMU博士论文】基于自适应表征的高效视觉建模

《多域作战中融合网络、电子战与动能机动》

AI智能体时代大模型安全风险与攻防新挑战

迈向个性化大语言模型驱动的智能体：基础、评估与未来方向

相关资讯

「知识增强预训练语言模型」最新研究综述

「知识增强预训练语言模型」最新研究综述

专知

18+阅读 · 2022年11月18日

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

【KDD2020-Tutorial】因果推理与稳定学习，Causal Inference and Stable Learning

专知

11+阅读 · 2020年8月28日

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

【阿里巴巴-WWW2020】对抗性多模态表示学习的点击率预测，Adversarial Multimodal RL

专知

11+阅读 · 2020年3月17日

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

【WWW2020-新加坡国立大学】知识图谱强化负采样的推荐系统，Reinforced Negative Sampling

专知

22+阅读 · 2020年3月14日

Dropout、梯度消失/爆炸、Adam优化算法，神经网络优化算法看这一篇就够了

Dropout、梯度消失/爆炸、Adam优化算法，神经网络优化算法看这一篇就够了

AI100

14+阅读 · 2019年9月1日

「PPT」深度学习中的不确定性估计

「PPT」深度学习中的不确定性估计

专知

27+阅读 · 2019年7月20日

你的算法可靠吗？神经网络不确定性度量

你的算法可靠吗？神经网络不确定性度量

专知

40+阅读 · 2019年4月27日

稀疏性的3个优势 -《稀疏统计学习及其应用》

稀疏性的3个优势 -《稀疏统计学习及其应用》

遇见数学

15+阅读 · 2018年10月24日

【干货】结合单阶段和两阶段目标检测的优势：基于单次精化神经网络的目标检测方法

【干货】结合单阶段和两阶段目标检测的优势：基于单次精化神经网络的目标检测方法

专知

12+阅读 · 2018年1月12日

回归预测&时间序列预测

回归预测&时间序列预测

GBASE数据工程部数据团队

44+阅读 · 2017年5月17日

相关论文

Predictive Synthesis under Sporadic Participation: Evidence from Inflation Density Surveys

Arxiv

0+阅读 · 2月5日

Cardinality-Preserving Structured Sparse Graph Transformers for Molecular Property Prediction

Arxiv

0+阅读 · 2月2日

Context-aware Graph Causality Inference for Few-Shot Molecular Property Prediction

Arxiv

0+阅读 · 1月16日

Rep3Net: An Approach Exploiting Multimodal Representation for Molecular Bioactivity Prediction

Arxiv

0+阅读 · 1月12日

Investigating Knowledge Distillation Through Neural Networks for Protein Binding Affinity Prediction

Arxiv

0+阅读 · 1月7日

Instructor-inspired Machine Learning for Robust Molecular Property Prediction

Arxiv

0+阅读 · 1月7日

Environment-Adaptive Covariate Selection: Learning When to Use Spurious Correlations for Out-of-Distribution Prediction

Arxiv

0+阅读 · 1月5日

Improving the accuracy and generalizability of molecular property regression models with a substructure-substitution-rule-informed framework

Arxiv

0+阅读 · 1月5日

Detecting Performance Degradation under Data Shift in Pathology Vision-Language Model

Arxiv

0+阅读 · 1月2日

Revisiting Out-of-Distribution Detection in Real-time Object Detection: From Benchmark Pitfalls to a New Mitigation Paradigm

Arxiv

0+阅读 · 2025年12月31日

相关基金

含非正态及缺失数据的结构方程模型分析

国家自然科学基金

0+阅读 · 2015年12月31日

高维回归模型的预测稳定性研究

国家自然科学基金

3+阅读 · 2015年12月31日

基于多生物网络的蛋白质功能预测算法研究

国家自然科学基金

1+阅读 · 2015年12月31日

基于稳健估计方程的复杂纵向数据研究

国家自然科学基金

0+阅读 · 2015年12月31日

基于结构与序列信息的蛋白质-配体结合位点的预测

国家自然科学基金

8+阅读 · 2015年12月31日

数据内在结构和稀疏保持的大间隔分类方法研究

国家自然科学基金

2+阅读 · 2015年12月31日

复杂纵向数据的分位回归建模及其在生物医学大数据中的应用

国家自然科学基金

4+阅读 · 2015年12月31日

对具有非平衡多标签特性的蛋白质功能类型分类预测研究

国家自然科学基金

0+阅读 · 2014年12月31日

高维稀疏统计模型中的变量选择与检验

国家自然科学基金

1+阅读 · 2014年12月31日

蛋白质结构类预测中的特征信息提取与分类算法研究

国家自然科学基金

1+阅读 · 2014年12月31日

微信扫码咨询专知VIP会员