New powerful tools for tackling life science problems have been created by recent advances in machine learning. The purpose of the paper is to discuss the potential advantages of gene recommendation performed by artificial intelligence (AI). Indeed, gene recommendation engines try to solve this problem: if the user is interested in a set of genes, which other genes are likely to be related to the starting set and should be investigated? This task was solved with a custom deep learning recommendation engine, DeepProphet2 (DP2), which is freely available to researchers worldwide via https://www.generecommender.com?utm_source=DeepProphet2_paper&utm_medium=pdf. Hereafter, insights behind the algorithm and its practical applications are illustrated. The gene recommendation problem can be addressed by mapping the genes to a metric space where a distance can be defined to represent the real semantic distance between them. To achieve this objective a transformer-based model has been trained on a well-curated freely available paper corpus, PubMed. The paper describes multiple optimization procedures that were employed to obtain the best bias-variance trade-off, focusing on embedding size and network depth. In this context, the model's ability to discover sets of genes implicated in diseases and pathways was assessed through cross-validation. A simple assumption guided the procedure: the network had no direct knowledge of pathways and diseases but learned genes' similarities and the interactions among them. Moreover, to further investigate the space where the neural network represents genes, the dimensionality of the embedding was reduced, and the results were projected onto a human-comprehensible space. In conclusion, a set of use cases illustrates the algorithm's potential applications in a real word setting.
翻译:机器学习的最新进展为生命科学问题提供了强大的新工具。本文旨在探讨人工智能(AI)进行基因推荐的潜在优势。具体而言,基因推荐引擎试图解决以下问题:若用户对一组基因感兴趣,哪些其他基因可能与初始集合相关并值得进一步研究?该任务通过定制的深度学习推荐引擎DeepProphet2(DP2)得以解决,该引擎通过https://www.generecommender.com?utm_source=DeepProphet2_paper&utm_medium=pdf 免费向全球研究人员开放。下文将阐述该算法背后的原理及其实际应用。基因推荐问题可通过将基因映射至一个度量空间来解决,在该空间中,可定义距离以表征基因间的真实语义距离。为实现这一目标,基于Transformer的模型在精心整理且免费开放的论文语料库PubMed上进行了训练。本文描述了为达到最优偏差-方差权衡而采用的多种优化策略,重点聚焦于嵌入维度与网络深度。在此背景下,通过交叉验证评估了模型发现疾病与通路相关基因集的能力。这一过程基于一个简单假设:网络并不直接知晓通路与疾病信息,而是通过学习基因间的相似性及其相互作用来推断。此外,为深入探究神经网络表征基因的空间,我们降低了嵌入维度并将结果投影至人类可理解的空间。最后,通过一系列用例展示了该算法在真实场景中的潜在应用。