Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.
翻译:近期,预训练基础模型在多个领域推动了重大进展。然而,在分子机器学习领域中,数据集通常手工整理且规模较小,缺乏带有标注特征的数据集及管理这些数据集的代码库,这阻碍了基础模型的发展。本研究提出了七个按规模分为三类的新数据集:ToyMix、LargeMix和UltraLarge。这些数据集在分子学习的有监督标签规模和多样性方面突破了现有边界,涵盖近1亿个分子和超过3000个稀疏定义的任务,总计超过130亿个量子与生物属性的独立标签。相比之下,我们的数据集包含的数据点数量是广泛使用的OGB-LSC PCQM4Mv2数据集的300倍,是仅含量子属性的QM1B数据集的13倍。此外,为支持基于所提数据集的基础模型开发,我们推出了Graphium图机器学习库,该库简化了构建和训练面向多任务与多层级分子数据集的分子机器学习模型流程。最后,我们提供了一系列基线结果,作为在这些数据集上开展多任务与多层级训练的起点。实验表明,量子数据的大规模训练能提升低资源生物数据集的性能,这提示通过多任务与多层级训练基础模型并微调至资源受限的下游任务可能具有潜力。