Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

Dominique Beaini,Shenyang Huang,Joao Alex Cunha,Gabriela Moisescu-Pareja,Oleksandr Dymov,Samuel Maddrell-Mander,Callum McLean,Frederik Wenkel,Luis Müller,Jama Hussein Mohamud,Ali Parviz,Michael Craig,Michał Koziarski,Jiarui Lu,Zhaocheng Zhu,Cristian Gabellini,Kerstin Klaser,Josef Dean,Cas Wognum,Maciej Sypetkowski,Guillaume Rabusseau,Reihaneh Rabbany,Jian Tang,Christopher Morris,Mirco Ravanelli,Guy Wolf,Prudencio Tossou,Hadrien Mary,Therence Bois,Andrew Fitzgibbon,Błażej Banaszewski,Chad Martin,Dominic Masters

Recently, pre-trained foundation models have enabled significant advancements in multiple fields. In molecular machine learning, however, where datasets are often hand-curated, and hence typically small, the lack of datasets with labeled features, and codebases to manage those datasets, has hindered the development of foundation models. In this work, we present seven novel datasets categorized by size into three distinct categories: ToyMix, LargeMix and UltraLarge. These datasets push the boundaries in both the scale and the diversity of supervised labels for molecular learning. They cover nearly 100 million molecules and over 3000 sparsely defined tasks, totaling more than 13 billion individual labels of both quantum and biological nature. In comparison, our datasets contain 300 times more data points than the widely used OGB-LSC PCQM4Mv2 dataset, and 13 times more than the quantum-only QM1B dataset. In addition, to support the development of foundational models based on our proposed datasets, we present the Graphium graph machine learning library which simplifies the process of building and training molecular machine learning models for multi-task and multi-level molecular datasets. Finally, we present a range of baseline results as a starting point of multi-task and multi-level training on these datasets. Empirically, we observe that performance on low-resource biological datasets show improvement by also training on large amounts of quantum data. This indicates that there may be potential in multi-task and multi-level training of a foundation model and fine-tuning it to resource-constrained downstream tasks.

翻译：近期，预训练基础模型在多个领域推动了重大进展。然而，在分子机器学习领域中，数据集通常手工整理且规模较小，缺乏带有标注特征的数据集及管理这些数据集的代码库，这阻碍了基础模型的发展。本研究提出了七个按规模分为三类的新数据集：ToyMix、LargeMix和UltraLarge。这些数据集在分子学习的有监督标签规模和多样性方面突破了现有边界，涵盖近1亿个分子和超过3000个稀疏定义的任务，总计超过130亿个量子与生物属性的独立标签。相比之下，我们的数据集包含的数据点数量是广泛使用的OGB-LSC PCQM4Mv2数据集的300倍，是仅含量子属性的QM1B数据集的13倍。此外，为支持基于所提数据集的基础模型开发，我们推出了Graphium图机器学习库，该库简化了构建和训练面向多任务与多层级分子数据集的分子机器学习模型流程。最后，我们提供了一系列基线结果，作为在这些数据集上开展多任务与多层级训练的起点。实验表明，量子数据的大规模训练能提升低资源生物数据集的性能，这提示通过多任务与多层级训练基础模型并微调至资源受限的下游任务可能具有潜力。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【CVPR 2022】一个完全无监督的框架，从噪声和部分测量中学习图像，Robust Equivariant Imaging: a fully unsupervised framework for learning to image

专知会员服务

25+阅读 · 2022年3月3日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日