Large language models (LLMs) have achieved great success in general domains of natural language processing. In this paper, we bring LLMs to the realm of geoscience with the objective of advancing research and applications in this field. To this end, we present the first-ever LLM in geoscience, K2, alongside a suite of resources developed to further promote LLM research within geoscience. For instance, we have curated the first geoscience instruction tuning dataset, GeoSignal, which aims to align LLM responses to geoscience-related user queries. Additionally, we have established the first geoscience benchmark, GeoBench, to evaluate LLMs in the context of geoscience. In this work, we experiment with a complete recipe to adapt a pre-trained general-domain LLM to the geoscience domain. Specifically, we further train the LLaMA-7B model on 5.5B tokens of geoscience text corpus, including over 1 million pieces of geoscience literature, and utilize GeoSignal's supervised data to fine-tune the model. Moreover, we share a protocol that can efficiently gather domain-specific data and construct domain-supervised data, even in situations where manpower is scarce. Meanwhile, we equip K2 with the abilities of using tools to be a naive geoscience aide. Experiments conducted on the GeoBench demonstrate the effectiveness of our approach and datasets on geoscience knowledge understanding and utilization.We open-source all the training data and K2 model checkpoints at https://github.com/davendw49/k2.
翻译:大语言模型(LLMs)在自然语言处理通用领域已取得巨大成功。本文旨在将大语言模型引入地球科学领域,以推动该领域的研究与应用发展。为此,我们首次提出面向地球科学的大语言模型——K2,并配套开发了一系列资源以促进地球科学领域的LLM研究。例如,我们构建了首个地球科学指令微调数据集GeoSignal,旨在使LLM的响应与地球科学相关的用户查询对齐。此外,我们建立了首个地球科学评估基准GeoBench,用于在地球科学情境下评估LLM。本研究完整探索了将预训练通用域大语言模型适配至地球科学领域的技术路线:具体而言,我们在包含超百万篇地球科学文献的55亿词元地球科学文本语料上对LLaMA-7B模型进行继续训练,并利用GeoSignal的监督数据对模型进行微调。同时,我们提出了一种高效收集领域特定数据并构建领域监督数据的协议,即使在人力匮乏的场景下也能有效运作。此外,我们赋予K2工具调用能力,使其成为初具雏形的地球科学助手。在GeoBench上开展的实验证明了我们的方法与数据集在地球科学知识理解与利用方面的有效性。我们已在https://github.com/davendw49/k2开源全部训练数据及K2模型检查点。