Activity cliff (AC) is a phenomenon that a pair of similar molecules differ by a small structural alternation but exhibit a large difference in their biochemical activities. The AC of small molecules has been extensively investigated but limited knowledge is accumulated about the AC phenomenon in peptides with canonical amino acids. This study introduces a quantitative definition and benchmarking framework AMPCliff for the AC phenomenon in antimicrobial peptides (AMPs) composed by canonical amino acids. A comprehensive analysis of the existing AMP dataset reveals a significant prevalence of AC within AMPs. AMPCliff quantifies the activities of AMPs by the metric minimum inhibitory concentration (MIC), and defines 0.9 as the minimum threshold for the normalized BLOSUM62 similarity score between a pair of aligned peptides with at least two-fold MIC changes. This study establishes a benchmark dataset of paired AMPs in Staphylococcus aureus from the publicly available AMP dataset GRAMPA, and conducts a rigorous procedure to evaluate various AMP AC prediction models, including nine machine learning, four deep learning algorithms, four masked language models, and four generative language models. Our analysis reveals that these models are capable of detecting AMP AC events and the pre-trained protein language ESM2 model demonstrates superior performance across the evaluations. The predictive performance of AMP activity cliffs remains to be further improved, considering that ESM2 with 33 layers only achieves the Spearman correlation coefficient=0.50 for the regression task of the MIC values on the benchmark dataset. Source code and additional resources are available at https://www.healthinformaticslab.org/supp/ or https://github.com/Kewei2023/AMPCliff-generation.
翻译:活性悬崖(AC)是指一对相似分子因微小结构变化而表现出显著生化活性差异的现象。小分子AC已被广泛研究,但关于含经典氨基酸肽类中AC现象的认知积累有限。本研究针对由经典氨基酸组成的抗菌肽(AMPs)中的AC现象,提出了定量定义与基准测试框架AMPCliff。对现有AMP数据集的综合分析揭示了AC在AMPs中的显著普遍性。AMPCliff通过最小抑菌浓度(MIC)指标量化AMP活性,并将比对肽对间归一化BLOSUM62相似性得分的最小阈值定义为0.9,同时要求MIC变化至少达两倍。本研究基于公开AMP数据集GRAMPA建立了针对金黄色葡萄球菌的配对AMP基准数据集,并采用严格流程评估了多种AMP AC预测模型,涵盖九种机器学习算法、四种深度学习算法、四种掩码语言模型和四种生成语言模型。分析表明,这些模型均能检测AMP AC事件,其中预训练蛋白质语言模型ESM2在所有评估中表现出更优性能。考虑到33层ESM2在基准数据集的MIC值回归任务中仅达到斯皮尔曼相关系数0.50,AMP活性悬崖的预测性能仍有待进一步提升。源代码及附加资源可在https://www.healthinformaticslab.org/supp/ 或 https://github.com/Kewei2023/AMPCliff-generation 获取。