In the field of artificial intelligence for science, it is consistently an essential challenge to face a limited amount of labeled data for real-world problems. The prevailing approach is to pretrain a powerful task-agnostic model on a large unlabeled corpus but may struggle to transfer knowledge to downstream tasks. In this study, we propose InstructMol, a semi-supervised learning algorithm, to take better advantage of unlabeled examples. It introduces an instructor model to provide the confidence ratios as the measurement of pseudo-labels' reliability. These confidence scores then guide the target model to pay distinct attention to different data points, avoiding the over-reliance on labeled data and the negative influence of incorrect pseudo-annotations. Comprehensive experiments show that InstructBio substantially improves the generalization ability of molecular models, in not only molecular property predictions but also activity cliff estimations, demonstrating the superiority of the proposed method. Furthermore, our evidence indicates that InstructBio can be equipped with cutting-edge pretraining methods and used to establish large-scale and task-specific pseudo-labeled molecular datasets, which reduces the predictive errors and shortens the training process. Our work provides strong evidence that semi-supervised learning can be a promising tool to overcome the data scarcity limitation and advance molecular representation learning.
翻译:在人工智能用于科学领域,面对真实世界问题中有限的标注数据始终是一项核心挑战。当前主流方法是在大规模无标注语料上预训练强大的任务无关模型,但可能难以将知识迁移至下游任务。本研究提出半监督学习算法InstructMol,以更有效地利用无标注样本。该算法引入一个指导模型来提供置信度比率,作为伪标签可靠性的度量标准。这些置信度分数进而引导目标模型对不同数据点给予差异化关注,避免过度依赖标注数据以及错误伪标注带来的负面影响。大量实验表明,InstructBio显著提升了分子模型的泛化能力——不仅在分子属性预测任务中表现优异,在活性悬崖评估中也同样出色,充分证明了所提方法的优越性。此外,我们的证据表明,InstructBio能够与前沿预训练方法相结合,用于构建大规模、任务特异性的伪标注分子数据集,从而降低预测误差并缩短训练流程。本研究有力证明了半监督学习可成为克服数据稀缺瓶颈、推动分子表征学习的有效工具。