This paper introduces ClimateGPT, a model family of domain-specific large language models that synthesize interdisciplinary research on climate change. We trained two 7B models from scratch on a science-oriented dataset of 300B tokens. For the first model, the 4.2B domain-specific tokens were included during pre-training and the second was adapted to the climate domain after pre-training. Additionally, ClimateGPT-7B, 13B and 70B are continuously pre-trained from Llama~2 on a domain-specific dataset of 4.2B tokens. Each model is instruction fine-tuned on a high-quality and human-generated domain-specific dataset that has been created in close cooperation with climate scientists. To reduce the number of hallucinations, we optimize the model for retrieval augmentation and propose a hierarchical retrieval strategy. To increase the accessibility of our model to non-English speakers, we propose to make use of cascaded machine translation and show that this approach can perform comparably to natively multilingual models while being easier to scale to a large number of languages. Further, to address the intrinsic interdisciplinary aspect of climate change we consider different research perspectives. Therefore, the model can produce in-depth answers focusing on different perspectives in addition to an overall answer. We propose a suite of automatic climate-specific benchmarks to evaluate LLMs. On these benchmarks, ClimateGPT-7B performs on par with the ten times larger Llama-2-70B Chat model while not degrading results on general domain benchmarks. Our human evaluation confirms the trends we saw in our benchmarks. All models were trained and evaluated using renewable energy and are released publicly.
翻译:本文介绍ClimateGPT——一个专门用于综合气候变化跨学科研究的领域特定大语言模型家族。我们在一个包含300B token的科学导向数据集上从头训练了两个7B模型:第一个模型在预训练阶段纳入了4.2B领域特定token,第二个模型则在预训练后针对气候领域进行适配。此外,ClimateGPT-7B、13B和70B模型均基于Llama~2在4.2B token的领域特定数据集上持续预训练。每个模型均在高质量、由人类生成的领域特定数据集上进行指令微调,该数据集与气候科学家密切合作构建。为减少幻觉现象,我们优化了模型的检索增强能力并提出分层检索策略。为提升非英语使用者的模型可及性,我们提出利用级联机器翻译方法,并证明该方法在性能上与原生多语言模型相当,且更易扩展至大规模语言。针对气候变化固有的跨学科特性,我们充分考虑不同研究视角,使模型不仅能给出整体性回答,还能针对不同视角提供深度解析。我们提出一套自动化气候特定基准来评估大语言模型。在这些基准上,ClimateGPT-7B的表现与规模大十倍的Llama-2-70B Chat模型持平,同时不降低通用领域基准的评测结果。人工评估验证了我们基准测试中观察到的趋势。所有模型均使用可再生能源进行训练和评估,并已公开发布。