Protein retrieval, which targets the deconstruction of the relationship between sequences, structures and functions, empowers the advancing of biology. Basic Local Alignment Search Tool (BLAST), a sequence-similarity-based algorithm, has proved the efficiency of this field. Despite the existing tools for protein retrieval, they prioritize sequence similarity and probably overlook proteins that are dissimilar but share homology or functionality. In order to tackle this problem, we propose a novel protein retrieval framework that mitigates the bias towards sequence similarity. Our framework initiatively harnesses protein language models (PLMs) to embed protein sequences within a high-dimensional feature space, thereby enhancing the representation capacity for subsequent analysis. Subsequently, an accelerated indexed vector database is constructed to facilitate expedited access and retrieval of dense vectors. Extensive experiments demonstrate that our framework can equally retrieve both similar and dissimilar proteins. Moreover, this approach enables the identification of proteins that conventional methods fail to uncover. This framework will effectively assist in protein mining and empower the development of biology.
翻译:蛋白质检索旨在解构序列、结构与功能之间的关系,对推动生物学发展具有重要意义。基于序列相似性的算法——基本局部比对搜索工具(BLAST)已证明了该领域的有效性。尽管现有蛋白质检索工具已取得进展,但它们往往优先考虑序列相似性,可能忽略序列不同但具有同源性或功能相似性的蛋白质。为解决这一问题,我们提出了一种新型蛋白质检索框架,以减轻对序列相似性的偏倚。该框架创新性地利用蛋白质语言模型(PLMs)将蛋白质序列嵌入高维特征空间,从而增强后续分析的表征能力。随后,构建了加速索引向量数据库以实现稠密向量的快速访问与检索。大量实验表明,本框架能够对相似及非相似蛋白质实现均衡检索。此外,该方法能够识别传统方法无法发现的蛋白质。该框架将有效辅助蛋白质挖掘,并推动生物学研究的发展。