Large language models (LLMs) such as ChatGPT have gained considerable interest across diverse research communities. Their notable ability for text completion and generation has inaugurated a novel paradigm for language-interfaced problem solving. However, the potential and efficacy of these models in bioinformatics remain incompletely explored. In this work, we study the performance LLMs on a wide spectrum of crucial bioinformatics tasks. These tasks include the identification of potential coding regions, extraction of named entities for genes and proteins, detection of antimicrobial and anti-cancer peptides, molecular optimization, and resolution of educational bioinformatics problems. Our findings indicate that, given appropriate prompts, LLMs like GPT variants can successfully handle most of these tasks. In addition, we provide a thorough analysis of their limitations in the context of complicated bioinformatics tasks. In conclusion, we believe that this work can provide new perspectives and motivate future research in the field of LLMs applications, AI for Science and bioinformatics.
翻译:大型语言模型(如ChatGPT)已在不同研究领域引起广泛关注。其显著的文本补全与生成能力开创了以语言为接口的问题解决新范式。然而,这些模型在生物信息学领域的潜力与效能尚未得到充分探索。本研究系统评估了大型语言模型在多项关键生物信息学任务中的表现,具体包括:编码区域识别、基因与蛋白质命名实体抽取、抗菌肽与抗癌肽检测、分子优化以及生物信息学教育问题的解答。研究结果表明,在适当提示条件下,GPT类大型语言模型能够成功处理大部分此类任务。此外,我们深入分析了这些模型在复杂生物信息学任务中的局限性。综上所述,本研究可为大型语言模型应用、人工智能驱动的科学研究及生物信息学领域的未来探索提供新视角与启示。