Tandem repeats in proteins identification, classification and curation is a complex process that requires manual processing from experts, processing power and time. There are recent and relevant advances applying machine learning for protein structure prediction and repeat classification that are useful for this process. However, no service contemplates required databases and software to supplement researching on repeat proteins. In this publication we present Daisy, an integrated repeat protein curation web service. This service can process Protein Data Bank (PDB) and the AlphaFold Database entries for tandem repeats identification. In addition, it uses an algorithm to search a sequence against a library of Pfam hidden Markov model (HMM). Repeat classifications are associated with the identified families through RepeatsDB. This prediction is considered for enhancing the ReUPred algorithm execution and hastening the repeat units identification process. The service can also operate every associated PDB and AlphaFold structure with a UniProt proteome registry. Availability: The Daisy web service is freely accessible at daisy.bioinformatica.org.
翻译:蛋白质中串联重复序列的识别、分类与注释是一个复杂过程,需要专家人工处理、计算资源及时间。近期在蛋白质结构预测和重复分类领域应用机器学习取得了相关进展,对此过程具有助益。然而,尚无服务整合研究重复蛋白所需的数据库与软件。本文介绍Daisy——一个集成的重复蛋白注释网络服务。该服务可处理蛋白质数据库(PDB)及AlphaFold数据库条目以识别串联重复序列。此外,其采用算法将序列比对至Pfam隐马尔可夫模型(HMM)库,并通过RepeatsDB将重复分类与已识别家族相关联。该预测结果可用于优化ReUPred算法执行,加速重复单元识别流程。本服务还可基于UniProt蛋白质组注册信息,处理所有关联的PDB及AlphaFold结构。可用性:Daisy网络服务可通过daisy.bioinformatica.org免费访问。