Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmGPT, a suite of multilingual LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus of hundreds of billions of tokens tailored to the Bio-Pharmaceutical and Chemical sectors. Our evaluation shows that PharmGPT matches or surpasses existing general models on key benchmarks, such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. This advancement establishes a new benchmark for LLMs in the Bio-Pharmaceutical and Chemical fields, addressing the existing gap in specialized language modeling. Furthermore, this suggests a promising path for enhanced research and development in these specialized areas, paving the way for more precise and effective applications of NLP in specialized domains.
翻译:大语言模型通过最大限度地减少对复杂特征工程的需求,彻底改变了自然语言处理领域。然而,大语言模型在生物制药和化学等专业领域的应用在很大程度上仍未得到探索。这些领域具有术语复杂、知识专业且对精确性要求极高的特点,而通用大语言模型在这些方面往往表现不足。在本研究中,我们推出了PharmGPT,这是一套包含130亿和700亿参数的多语言大语言模型,专门基于为生物制药和化学领域定制的、包含数千亿标记的全面语料库进行训练。我们的评估表明,在诸如NAPLEX等关键基准测试中,PharmGPT达到或超越了现有的通用模型,证明了其在领域特定任务上的卓越能力。这一进展为生物制药和化学领域的大语言模型设立了新基准,填补了专用语言建模领域的现有空白。此外,这为这些专业领域的研究与开发提供了一条前景广阔的增强路径,为自然语言处理在专业领域实现更精准、更有效的应用铺平了道路。