Instruction tuning is an important step in making language models useful for direct user interaction. However, the legal domain is underrepresented in typical instruction datasets (e.g., only 10 out of 1600+ tasks in Super-NaturalInstructions). To study whether instruction tuning on legal datasets is necessary for strong legal reasoning, we aggregate 58 annotated legal datasets and write instructions for each, creating LawInstruct. LawInstruct covers 17 global jurisdictions, 24 languages and a total of 12M examples across diverse tasks such as legal QA, summarization of court cases, and legal argument mining. We evaluate our models on LegalBench, measuring legal reasoning across five categories in 162 challenging and realistic legal tasks, and MMLU, to measure potential drops in general reasoning capabilities. We find that legal-specific instruction tuning on Flan-T5 - yielding FLawN-T5 - improves performance on LegalBench across all model sizes, with an aggregate increase of 15 points or 50% over Flan-T5 for the base size. No model size shows performance drops in MMLU. We publish LawInstruct as a resource for further study of instruction tuning in the legal domain.
翻译:指令微调是使语言模型适用于直接用户交互的重要步骤。然而,典型指令数据集(例如Super-NaturalInstructions中1600多项任务中仅有10项)中法律领域的代表性不足。为研究针对法律数据集的指令微调是否对强大的法律推理能力是必要的,我们汇总了58个带标注的法律数据集并为每个数据集编写指令,创建了LawInstruct。LawInstruct涵盖17个全球司法管辖区、24种语言,总计包含1200万个示例,覆盖法律问答、法庭案件摘要和法律论据挖掘等多种任务。我们在LegalBench上评估模型,测量其在162项具有挑战性和现实性的法律任务中跨越五个类别的法律推理能力,并在MMLU上评估以衡量通用推理能力的潜在下降。我们发现,基于Flan-T5进行法律专用指令微调(得到FLawN-T5)能提升所有模型尺寸在LegalBench上的性能,其中基础尺寸模型相比Flan-T5总体提升15分(即50%)。所有模型尺寸在MMLU上均未出现性能下降。我们将LawInstruct作为资源公开发布,以促进法律领域指令微调的进一步研究。