GeoGalactica: A Scientific Large Language Model in Geoscience

Zhouhan Lin,Cheng Deng,Le Zhou,Tianhang Zhang,Yi Xu,Yutong Xu,Zhongmou He,Yuanyuan Shi,Beiya Dai,Yunchong Song,Boyi Zeng,Qiyuan Chen,Tao Shi,Tianyu Huang,Yiwei Xu,Shu Wang,Luoyi Fu,Weinan Zhang,Junxian He,Chao Ma,Yunqiang Zhu,Xinbing Wang,Chenghu Zhou

Large language models (LLMs) have achieved huge success for their general knowledge and ability to solve a wide spectrum of tasks in natural language processing (NLP). Due to their impressive abilities, LLMs have shed light on potential inter-discipline applications to foster scientific discoveries of a specific domain by using artificial intelligence (AI for science, AI4S). In the meantime, utilizing NLP techniques in geoscience research and practice is wide and convoluted, contributing from knowledge extraction and document classification to question answering and knowledge discovery. In this work, we take the initial step to leverage LLM for science, through a rather straightforward approach. We try to specialize an LLM into geoscience, by further pre-training the model with a vast amount of texts in geoscience, as well as supervised fine-tuning (SFT) the resulting model with our custom collected instruction tuning dataset. These efforts result in a model GeoGalactica consisting of 30 billion parameters. To our best knowledge, it is the largest language model for the geoscience domain. More specifically, GeoGalactica is from further pre-training of Galactica. We train GeoGalactica over a geoscience-related text corpus containing 65 billion tokens curated from extensive data sources in the big science project Deep-time Digital Earth (DDE), preserving as the largest geoscience-specific text corpus. Then we fine-tune the model with 1 million pairs of instruction-tuning data consisting of questions that demand professional geoscience knowledge to answer. In this technical report, we will illustrate in detail all aspects of GeoGalactica, including data collection, data cleaning, base model selection, pre-training, SFT, and evaluation. We open-source our data curation tools and the checkpoints of GeoGalactica during the first 3/4 of pre-training.

翻译：大语言模型因其通用知识及解决自然语言处理中广泛任务的能力取得了巨大成功。凭借其出色能力，大语言模型为跨学科应用开辟了前景，有望通过人工智能促进特定领域的科学发现（AI for Science, AI4S）。与此同时，自然语言处理技术在地学研究与实践中已得到广泛而深入的应用，涵盖知识抽取、文档分类、问答系统及知识发现等多个方面。本研究以直接的方式迈出将大语言模型用于科学探索的第一步：通过在地学海量文本上继续预训练，并利用自建指令微调数据集对模型进行有监督微调，实现大语言模型在地学领域的专精化。由此产生的GeoGalactica模型包含300亿参数，据我们所知，这是目前地学领域最大的语言模型。具体而言，GeoGalactica基于Galactica进行继续预训练，在大型科学项目“深时数字地球”（DDE）的多源数据中构建了包含650亿词符的地学文本语料库（这也是迄今最大的地学专用文本语料库）完成训练。随后，我们采用包含100万对需要专业地学知识才能回答的指令微调数据对模型进行微调。本技术报告将详细阐述GeoGalactica的全部流程，包括数据收集、数据清洗、基座模型选择、预训练、有监督微调及评估。我们开源了数据清洗工具以及预训练前四分之三阶段的GeoGalactica模型检查点。

相关内容

大语言模型

关注 66

大语言模型是基于海量文本数据训练的深度学习模型。它不仅能够生成自然语言文本，还能够深入理解文本含义，处理各种自然语言任务，如文本摘要、问答、翻译等。2023年，大语言模型及其在人工智能领域的应用已成为全球科技研究的热点，其在规模上的增长尤为引人注目，参数量已从最初的十几亿跃升到如今的一万亿。参数量的提升使得模型能够更加精细地捕捉人类语言微妙之处，更加深入地理解人类语言的复杂性。在过去的一年里，大语言模型在吸纳新知识、分解复杂任务以及图文对齐等多方面都有显著提升。随着技术的不断成熟，它将不断拓展其应用范围，为人类提供更加智能化和个性化的服务，进一步改善人们的生活和生产方式。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

32+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日