Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmaGPT, a suite of domain specilized LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus tailored to the Bio-Pharmaceutical and Chemical domains. Our evaluation shows that PharmaGPT surpasses existing general models on specific-domain benchmarks such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. Remarkably, this performance is achieved with a model that has only a fraction, sometimes just one-tenth-of the parameters of general-purpose large models. This advancement establishes a new benchmark for LLMs in the bio-pharmaceutical and chemical fields, addressing the existing gap in specialized language modeling. It also suggests a promising path for enhanced research and development, paving the way for more precise and effective NLP applications in these areas.
翻译:大语言模型(LLMs)通过极大减少复杂特征工程的需求,彻底改变了自然语言处理(NLP)领域。然而,LLMs在生物制药和化学等专业领域的应用在很大程度上仍未得到充分探索。这些领域具有术语复杂、知识专业且对精确度要求极高的特点,而通用大语言模型在这些方面往往表现不足。在本研究中,我们推出了PharmaGPT,这是一套包含130亿和700亿参数的领域专用大语言模型,专门针对生物制药与化学领域构建的全面语料库进行训练。我们的评估表明,PharmaGPT在NAPLEX等特定领域基准测试中超越了现有的通用模型,展现了其在领域特定任务上的卓越能力。值得注意的是,这一性能是通过一个参数量仅为通用大模型一部分(有时甚至仅为其十分之一)的模型实现的。这一进展为生物制药和化学领域的大语言模型设立了新的标杆,弥补了当前专用语言建模的空白。同时,它也预示着一条提升研发效率的可行路径,为在这些领域实现更精准、更有效的NLP应用铺平了道路。