Large language models (LLMs), including both proprietary and open-source models, have showcased remarkable capabilities in addressing a wide range of downstream tasks. Nonetheless, when it comes to practical Chinese legal tasks, these models fail to meet the actual requirements. Proprietary models do not ensure data privacy for sensitive legal cases, while open-source models demonstrate unsatisfactory performance due to their lack of legal knowledge. To address this problem, we introduce LawGPT, the first open-source model specifically designed for Chinese legal applications. LawGPT comprises two key components: legal-oriented pre-training and legal supervised fine-tuning. Specifically, we employ large-scale Chinese legal documents for legal-oriented pre-training to incorporate legal domain knowledge. To further improve the model's performance on downstream legal tasks, we create a knowledge-driven instruction dataset for legal supervised fine-tuning. Our experimental results demonstrate that LawGPT outperforms the open-source LLaMA 7B model. Our code and resources are publicly available at https://github.com/pengxiao-song/LaWGPT and have received 5.7K stars on GitHub.
翻译:大型语言模型(LLMs),包括专有模型和开源模型,在解决广泛的下游任务中展现了卓越的能力。然而,在应对实际的中文法律任务时,这些模型未能满足实际需求。专有模型无法确保敏感法律案件的数据隐私,而开源模型因缺乏法律知识表现出不尽人意的性能。为解决这一问题,我们提出了LawGPT,这是首个专门针对中文法律应用的开源模型。LawGPT包含两个关键组成部分:面向法律的预训练和法律监督微调。具体而言,我们使用大规模中文法律文档进行面向法律的预训练,以融入法律领域知识。为进一步提升模型在法律下游任务上的性能,我们构建了一个知识驱动的指令数据集用于法律监督微调。实验结果表明,LawGPT的性能优于开源LLaMA 7B模型。我们的代码和资源已在https://github.com/pengxiao-song/LaWGPT上公开,并在GitHub上获得了5700颗星。