Realizing the recent advances in Natural Language Processing (NLP) to the legal sector poses challenging problems such as extremely long sequence lengths, specialized vocabulary that is usually only understood by legal professionals, and high amounts of data imbalance. The recent surge of Large Language Models (LLMs) has begun to provide new opportunities to apply NLP in the legal domain due to their ability to handle lengthy, complex sequences. Moreover, the emergence of domain-specific LLMs has displayed extremely promising results on various tasks. In this study, we aim to quantify how general LLMs perform in comparison to legal-domain models (be it an LLM or otherwise). Specifically, we compare the zero-shot performance of three general-purpose LLMs (ChatGPT-20b, LLaMA-2-70b, and Falcon-180b) on the LEDGAR subset of the LexGLUE benchmark for contract provision classification. Although the LLMs were not explicitly trained on legal data, we observe that they are still able to classify the theme correctly in most cases. However, we find that their mic-F1/mac-F1 performance is up to 19.2/26.8\% lesser than smaller models fine-tuned on the legal domain, thus underscoring the need for more powerful legal-domain LLMs.
翻译:将自然语言处理(NLP)的最新进展应用于法律领域,会带来诸多难题,例如超长序列长度、通常只有法律专业人士才能理解的专门词汇以及严重的数据不平衡问题。近年来大型语言模型(LLM)的兴起,因其处理冗长复杂序列的能力,开始为NLP在法律领域的应用提供新机遇。此外,特定领域LLM的出现已在各类任务中展现出极具前景的结果。本研究旨在量化通用LLM与法律领域模型(无论是LLM还是其他模型)的性能差异。具体而言,我们比较了三种通用LLM(ChatGPT-20b、LLaMA-2-70b和Falcon-180b)在LexGLUE基准测试的LEDGAR子集上用于合同条款分类的零样本性能。尽管这些LLM未明确使用法律数据进行训练,但我们观察到它们在大多数情况下仍能正确分类主题。然而,我们发现其微F1/宏F1性能比在法律领域微调过的较小模型低19.2/26.8%,这凸显了对更强大的法律领域LLM的需求。