The emergence of foundation models in Computer Vision and Natural Language Processing have resulted in immense progress on downstream tasks. This progress was enabled by datasets with billions of training examples. Similar benefits are yet to be unlocked for quantum chemistry, where the potential of deep learning is constrained by comparatively small datasets with 100k to 20M training examples. These datasets are limited in size because the labels are computed using the accurate (but computationally demanding) predictions of Density Functional Theory (DFT). Notably, prior DFT datasets were created using CPU supercomputers without leveraging hardware acceleration. In this paper, we take a first step towards utilising hardware accelerators by introducing the data generator PySCF$_{\text{IPU}}$ using Intelligence Processing Units (IPUs). This allowed us to create the dataset QM1B with one billion training examples containing 9-11 heavy atoms. We demonstrate that a simple baseline neural network (SchNet 9M) improves its performance by simply increasing the amount of training data without additional inductive biases. To encourage future researchers to use QM1B responsibly, we highlight several limitations of QM1B and emphasise the low-resolution of our DFT options, which also serves as motivation for even larger, more accurate datasets. Code and dataset are available on Github: http://github.com/graphcore-research/pyscf-ipu
翻译:计算机视觉和自然语言处理领域基础模型的出现,极大推动了下游任务的进展。这一进步得益于包含数十亿训练样本的数据集。量子化学领域尚未实现类似的突破,深度学习在该领域的潜力受限于相对较小的数据集(10万至2000万个训练样本)。这些数据集规模受限的原因在于,其标签需通过密度泛函理论(DFT)精确但计算密集的预测获得。值得注意的是,此前DFT数据集均使用CPU超级计算机创建,未利用硬件加速技术。本文首次尝试利用硬件加速器,引入基于智能处理单元(IPU)的数据生成器PySCF$_{\text{IPU}}$,从而创建了包含10亿个训练样本(涵盖9–11个重原子)的QM1B数据集。我们证明,简单基线神经网络(SchNet 9M)仅通过增加训练数据量即可提升性能,无需额外引入归纳偏置。为鼓励未来研究者负责任地使用QM1B,我们强调其多项局限性及DFT选项的低精度特性——这也为构建更大规模、更精确的数据集提供了动机。代码与数据集已开源至GitHub:http://github.com/graphcore-research/pyscf-ipu