The use of deep learning models in computational biology has increased massively in recent years, and it is expected to continue with the current advances in the fields such as Natural Language Processing. These models, although able to draw complex relations between input and target, are also inclined to learn noisy deviations from the pool of data used during their development. In order to assess their performance on unseen data (their capacity to generalize), it is common to split the available data randomly into development (train/validation) and test sets. This procedure, although standard, has been shown to produce dubious assessments of generalization due to the existing similarity between samples in the databases used. In this work, we present SpanSeq, a database partition method for machine learning that can scale to most biological sequences (genes, proteins and genomes) in order to avoid data leakage between sets. We also explore the effect of not restraining similarity between sets by reproducing the development of two state-of-the-art models on bioinformatics, not only confirming the consequences of randomly splitting databases on the model assessment, but expanding those repercussions to the model development. SpanSeq is available at https://github.com/genomicepidemiology/SpanSeq.
翻译:近年来,深度学习模型在计算生物学中的应用大幅增长,并且随着自然语言处理等领域的当前进展,预计这一趋势将持续。这些模型虽然能够捕捉输入与目标之间的复杂关系,但也倾向于从开发过程中使用的数据池中学习到噪声偏差。为了评估模型在未见数据上的性能(即其泛化能力),通常会将可用数据随机划分为开发集(训练/验证集)与测试集。尽管这一流程是标准做法,但由于所用数据库中样本间存在相似性,它已被证明可能导致对泛化能力的不可靠评估。本研究提出SpanSeq,一种适用于机器学习的数据库划分方法,该方法可扩展到大多数生物序列(基因、蛋白质和基因组),以避免集合间的数据泄露。我们还通过复现两个生物信息学领域前沿模型的开发过程,探讨了不限制集合间相似性的影响,不仅证实了随机划分数据库对模型评估的后果,还将这些影响扩展到了模型开发阶段。SpanSeq可在https://github.com/genomicepidemiology/SpanSeq获取。