We find that the best publicly available LLMs like GPT-4 and PaLM 2 currently perform poorly at basic text handling required of lawyers or paralegals, such as looking up the text at a line of a witness deposition or at a subsection of a contract. We introduce a benchmark to quantify this poor performance, which casts into doubt LLMs' current reliability as-is for legal practice. Finetuning for these tasks brings an older LLM to near-perfect performance on our test set and also raises performance on a related legal task. This stark result highlights the need for more domain expertise in LLM training.
翻译:我们发现,目前公开可用的最优大型语言模型(如GPT-4和PaLM 2)在处理律师或法律助理所需的基础文本任务(例如,查找证人证词中的某一行,或合同中的某一条款)时表现欠佳。我们引入了一个基准测试来量化这一不佳表现,这使人们对当前LLM在法律实践中作为独立工具的可靠性产生质疑。针对这些任务进行微调后,一个较早版本的LLM在我们的测试集上达到了近乎完美的性能,并在一个相关的法律任务上也有所提升。这一显著结果凸显了在LLM训练中融入更多领域专业知识的必要性。