We investigate trends in the data-error scaling behavior of machine learning (ML) models trained on discrete combinatorial spaces that are prone-to-mutation, such as proteins or organic small molecules. We trained and evaluated kernel ridge regression machines using variable amounts of computationally generated training data. Our synthetic datasets comprise i) two na\"ive functions based on many-body theory; ii) binding energy estimates between a protein and a mutagenised peptide; and iii) solvation energies of two 6-heavy atom structural graphs. In contrast to typical data-error scaling, our results showed discontinuous monotonic phase transitions during learning, observed as rapid drops in the test error at particular thresholds of training data. We observed two learning regimes, which we call saturated and asymptotic decay, and found that they are conditioned by the level of complexity (i.e. number of mutations) enclosed in the training set. We show that during training on this class of problems, the predictions were clustered by the ML models employed in the calibration plots. Furthermore, we present an alternative strategy to normalize learning curves (LCs) and the concept of mutant based shuffling. This work has implications for machine learning on mutagenisable discrete spaces such as chemical properties or protein phenotype prediction, and improves basic understanding of concepts in statistical learning theory.
翻译:我们研究了在易于突变的离散组合空间(如蛋白质或有机小分子)上训练的机器学习模型的数据-误差标度行为趋势。我们训练并评估了使用不同量计算生成训练数据的核岭回归模型。我们的合成数据集包括:i)基于多体理论的两个朴素函数;ii)蛋白质与突变肽之间的结合能估计;以及iii)两个六重原子结构图的溶剂化能。与典型的数据-误差标度不同,我们的结果显示学习过程中出现不连续的单调相变,表现为测试误差在特定训练数据阈值处的急剧下降。我们观察到了两种学习机制,分别称为饱和衰减和渐近衰减,并发现它们受训练集所包含的复杂度(即突变数量)调节。研究表明,在此类问题的训练过程中,校准图使用的机器学习模型对预测结果进行了聚类。此外,我们提出了一种替代策略来归一化学习曲线以及基于突变混洗的概念。这项工作对可突变离散空间(如化学性质或蛋白质表型预测)上的机器学习具有启示意义,并加深了对统计学习理论中基本概念的理解。