How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?

from arxiv, Accepted at the 3rd Workshop on Artificial Intelligence and Intelligent Assistance for Legal Professionals in the Digital Workplace (LegalAIIA 2023), in conjunction with the ICAIL 2023 conference

Automatic summarization of legal case judgements has traditionally been attempted by using extractive summarization methods. However, in recent years, abstractive summarization models are gaining popularity since they can generate more natural and coherent summaries. Legal domain-specific pre-trained abstractive summarization models are now available. Moreover, general-domain pre-trained Large Language Models (LLMs), such as ChatGPT, are known to generate high-quality text and have the capacity for text summarization. Hence it is natural to ask if these models are ready for off-the-shelf application to automatically generate abstractive summaries for case judgements. To explore this question, we apply several state-of-the-art domain-specific abstractive summarization models and general-domain LLMs on Indian court case judgements, and check the quality of the generated summaries. In addition to standard metrics for summary quality, we check for inconsistencies and hallucinations in the summaries. We see that abstractive summarization models generally achieve slightly higher scores than extractive models in terms of standard summary evaluation metrics such as ROUGE and BLEU. However, we often find inconsistent or hallucinated information in the generated abstractive summaries. Overall, our investigation indicates that the pre-trained abstractive summarization models and LLMs are not yet ready for fully automatic deployment for case judgement summarization; rather a human-in-the-loop approach including manual checks for inconsistencies is more suitable at present.

翻译：法律判决的自动摘要传统上主要采用抽取式摘要方法。然而近年来，由于能够生成更自然、连贯的摘要，抽象式摘要模型正日益受到青睐。目前已有法律领域专用的预训练抽象式摘要模型，此外，通用领域的预训练大型语言模型（如ChatGPT）也被认为能生成高质量文本并具备文本摘要能力。因此，我们自然要问：这些模型是否已准备好直接应用于法律判决的抽象式摘要自动生成？为探究此问题，我们对印度法院判决文本应用了多种最先进的领域专用抽象式摘要模型与通用领域大型语言模型，并评估生成摘要的质量。除标准摘要质量指标外，我们还检查了摘要中的不一致性与幻觉现象。结果显示，在ROUGE、BLEU等标准摘要评估指标上，抽象式模型通常略优于抽取式模型。但生成的抽象式摘要中常发现不一致或虚构信息。总体而言，本研究表明预训练抽象式摘要模型与大型语言模型尚未达到完全自主部署用于判决摘要生成的条件；当前更适合采用包含人工核查不一致信息的人机协同方法。