We find that the best publicly available LLMs like GPT-4, Claude, and {PaLM 2} currently perform poorly at basic legal text handling. We introduce a benchmark consisting of tasks that lawyers and paralegals would expect LLMs to handle zero-shot, such as looking up the text at a line of a witness deposition or at a subsection of a contract. LLMs' poor performance on this benchmark casts into doubt their reliability as-is for legal practice. However, fine-tuning for these tasks brings even a smaller model to near-perfect performance on our test set and also raises performance on a related legal task. These results suggest that many simple behaviors needed for a domain may not be present in foundational LLMs, without additional engagement from subject matter experts.
翻译:我们发现,目前公开可用的最佳大型语言模型(如GPT-4、Claude和PaLM 2)在处理基础法律文本方面表现不佳。我们引入了一项基准测试,其任务包括律师和法律助理期望大型语言模型能够零样本处理的内容,例如查找证人证词中的某一行或合同中的某个小节。大型语言模型在该基准测试中的糟糕表现对其在法律实践中的即用可靠性提出了质疑。然而,针对这些任务进行微调后,即使较小的模型也能在我们的测试集上达到近乎完美的性能,并在相关法律任务上提升表现。这些结果表明,领域所需的许多简单行为可能并未存在于基础大型语言模型中,除非有领域专家的额外参与。