PharmaGPT: Domain-Specific Large Language Models for Bio-Pharmaceutical and Chemistry

Linqing Chen,Weilei Wang,Zilong Bai,Peng Xu,Yan Fang,Jie Fang,Wentao Wu,Lizhi Zhou,Ruiji Zhang,Yubin Xia,Chaobo Xu,Ran Hu,Licong Xu,Qijun Cai,Haoran Hua,Jing Sun,Jin Liu,Tian Qiu,Haowen Liu,Meng Hu,Xiuwen Li,Fei Gao,Yufu Wang,Lin Tie,Chaochao Wang,Jianping Lu,Cheng Sun,Yixin Wang,Shengjie Yang,Yuancheng Li,Lu Jin,Lisha Zhang,Fu Bian,Zhongkai Ye,Lidong Pei,Changyang Tu

Large language models (LLMs) have revolutionized Natural Language Processing (NLP) by minimizing the need for complex feature engineering. However, the application of LLMs in specialized domains like biopharmaceuticals and chemistry remains largely unexplored. These fields are characterized by intricate terminologies, specialized knowledge, and a high demand for precision areas where general purpose LLMs often fall short. In this study, we introduce PharmaGPT, a suite of domain specilized LLMs with 13 billion and 70 billion parameters, specifically trained on a comprehensive corpus tailored to the Bio-Pharmaceutical and Chemical domains. Our evaluation shows that PharmaGPT surpasses existing general models on specific-domain benchmarks such as NAPLEX, demonstrating its exceptional capability in domain-specific tasks. Remarkably, this performance is achieved with a model that has only a fraction, sometimes just one-tenth-of the parameters of general-purpose large models. This advancement establishes a new benchmark for LLMs in the bio-pharmaceutical and chemical fields, addressing the existing gap in specialized language modeling. It also suggests a promising path for enhanced research and development, paving the way for more precise and effective NLP applications in these areas.

翻译：大语言模型（LLMs）通过极大减少复杂特征工程的需求，彻底改变了自然语言处理（NLP）领域。然而，LLMs在生物制药和化学等专业领域的应用在很大程度上仍未得到充分探索。这些领域具有术语复杂、知识专业且对精确度要求极高的特点，而通用大语言模型在这些方面往往表现不足。在本研究中，我们推出了PharmaGPT，这是一套包含130亿和700亿参数的领域专用大语言模型，专门针对生物制药与化学领域构建的全面语料库进行训练。我们的评估表明，PharmaGPT在NAPLEX等特定领域基准测试中超越了现有的通用模型，展现了其在领域特定任务上的卓越能力。值得注意的是，这一性能是通过一个参数量仅为通用大模型一部分（有时甚至仅为其十分之一）的模型实现的。这一进展为生物制药和化学领域的大语言模型设立了新的标杆，弥补了当前专用语言建模的空白。同时，它也预示着一条提升研发效率的可行路径，为在这些领域实现更精准、更有效的NLP应用铺平了道路。

相关内容

MoDELS

关注 46

ACM/IEEE第23届模型驱动工程语言和系统国际会议，是模型驱动软件和系统工程的首要会议系列，由ACM-SIGSOFT和IEEE-TCSE支持组织。自1998年以来，模型涵盖了建模的各个方面，从语言和方法到工具和应用程序。模特的参加者来自不同的背景，包括研究人员、学者、工程师和工业专业人士。MODELS 2019是一个论坛，参与者可以围绕建模和模型驱动的软件和系统交流前沿研究成果和创新实践经验。今年的版本将为建模社区提供进一步推进建模基础的机会，并在网络物理系统、嵌入式系统、社会技术系统、云计算、大数据、机器学习、安全、开源等新兴领域提出建模的创新应用以及可持续性。官网链接：http://www.modelsconference.org/

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日