Machine learning catalyzes a revolution in chemical and biological science. However, its efficacy heavily depends on the availability of labeled data, and annotating biochemical data is extremely laborious. To surmount this data sparsity challenge, we present an instructive learning algorithm named InstructMol to measure pseudo-labels' reliability and help the target model leverage large-scale unlabeled data. InstructMol does not require transferring knowledge between multiple domains, which avoids the potential gap between the pretraining and fine-tuning stages. We demonstrated the high accuracy of InstructMol on several real-world molecular datasets and out-of-distribution (OOD) benchmarks. Code is available at~ https://github.com/smiles724/InstructMol.
翻译:机器学习正在催化化学与生物科学领域的革命。然而,其效能高度依赖于标记数据的可用性,而生物化学数据的标注极其耗时费力。为克服这一数据稀疏性挑战,我们提出了一种名为InstructMol的指导性学习算法,用于评估伪标签的可靠性,并帮助目标模型利用大规模未标记数据。InstructMol无需在多个领域间迁移知识,从而避免了预训练与微调阶段间的潜在差距。我们在多个真实世界分子数据集及分布外(OOD)基准测试中验证了InstructMol的高准确性。代码发布于~ https://github.com/smiles724/InstructMol。