InstructBio: A Large-scale Semi-supervised Learning Paradigm for Biochemical Problems

In the field of artificial intelligence for science, it is consistently an essential challenge to face a limited amount of labeled data for real-world problems. The prevailing approach is to pretrain a powerful task-agnostic model on a large unlabeled corpus but may struggle to transfer knowledge to downstream tasks. In this study, we propose InstructMol, a semi-supervised learning algorithm, to take better advantage of unlabeled examples. It introduces an instructor model to provide the confidence ratios as the measurement of pseudo-labels' reliability. These confidence scores then guide the target model to pay distinct attention to different data points, avoiding the over-reliance on labeled data and the negative influence of incorrect pseudo-annotations. Comprehensive experiments show that InstructBio substantially improves the generalization ability of molecular models, in not only molecular property predictions but also activity cliff estimations, demonstrating the superiority of the proposed method. Furthermore, our evidence indicates that InstructBio can be equipped with cutting-edge pretraining methods and used to establish large-scale and task-specific pseudo-labeled molecular datasets, which reduces the predictive errors and shortens the training process. Our work provides strong evidence that semi-supervised learning can be a promising tool to overcome the data scarcity limitation and advance molecular representation learning.

翻译：在人工智能用于科学领域，面对真实世界问题中有限的标注数据始终是一项核心挑战。当前主流方法是在大规模无标注语料上预训练强大的任务无关模型，但可能难以将知识迁移至下游任务。本研究提出半监督学习算法InstructMol，以更有效地利用无标注样本。该算法引入一个指导模型来提供置信度比率，作为伪标签可靠性的度量标准。这些置信度分数进而引导目标模型对不同数据点给予差异化关注，避免过度依赖标注数据以及错误伪标注带来的负面影响。大量实验表明，InstructBio显著提升了分子模型的泛化能力——不仅在分子属性预测任务中表现优异，在活性悬崖评估中也同样出色，充分证明了所提方法的优越性。此外，我们的证据表明，InstructBio能够与前沿预训练方法相结合，用于构建大规模、任务特异性的伪标注分子数据集，从而降低预测误差并缩短训练流程。本研究有力证明了半监督学习可成为克服数据稀缺瓶颈、推动分子表征学习的有效工具。

相关内容

半监督学习

关注 2927

半监督学习(Semi-Supervised Learning，SSL)是模式识别和机器学习领域研究的重点问题，是监督学习与无监督学习相结合的一种学习方法。半监督学习使用大量的未标记数据，以及同时使用标记数据，来进行模式识别工作。当使用半监督学习时，将会要求尽量少的人员来从事工作，同时，又能够带来比较高的准确性，因此，半监督学习目前正越来越受到人们的重视。

高效可扩展图神经网络的研究进展，Recent Advances in Efficient and Scalable Graph Neural Networks

专知会员服务

78+阅读 · 2022年3月15日

【干货书】机器学习设计模式，408页pdf，Machine Learning Design Patterns

专知会员服务

138+阅读 · 2022年2月6日

图神经网络GNN预训练技术进展概述

专知会员服务

44+阅读 · 2021年4月12日

【WWW2021】大规模层次结构中的元数据感知文本分类

专知会员服务

17+阅读 · 2021年2月17日