How Ready are Pre-trained Abstractive Models and LLMs for Legal Case Judgement Summarization?

from arxiv, Accepted for presentation at the 3rd Workshop on Artificial Intelligence and Intelligent Assistance for Legal Professionals in the Digital Workplace (LegalAIIA 2023), co-located with the ICAIL 2023 conference

Automatic summarization of legal case judgements has traditionally been attempted by using extractive summarization methods. However, in recent years, abstractive summarization models are gaining popularity since they can generate more natural and coherent summaries. Legal domain-specific pre-trained abstractive summarization models are now available. Moreover, general-domain pre-trained Large Language Models (LLMs), such as ChatGPT, are known to generate high-quality text and have the capacity for text summarization. Hence it is natural to ask if these models are ready for off-the-shelf application to automatically generate abstractive summaries for case judgements. To explore this question, we apply several state-of-the-art domain-specific abstractive summarization models and general-domain LLMs on Indian court case judgements, and check the quality of the generated summaries. In addition to standard metrics for summary quality, we check for inconsistencies and hallucinations in the summaries. We see that abstractive summarization models generally achieve slightly higher scores than extractive models in terms of standard summary evaluation metrics such as ROUGE and BLEU. However, we often find inconsistent or hallucinated information in the generated abstractive summaries. Overall, our investigation indicates that the pre-trained abstractive summarization models and LLMs are not yet ready for fully automatic deployment for case judgement summarization; rather a human-in-the-loop approach including manual checks for inconsistencies is more suitable at present.

翻译：法律案例判决的自动摘要传统上主要采用抽取式摘要方法。然而近年来，抽象式摘要模型因其能生成更自然、更连贯的摘要而日益流行。目前已有法律领域特定的预训练抽象式摘要模型可用。此外，通用领域预训练的大型语言模型（如ChatGPT）已知能生成高质量文本并具备文本摘要能力。因此，自然产生一个问题：这些模型是否已准备好直接应用于自动生成案例判决的抽象式摘要？为探究此问题，我们在印度法院案例判决上应用了多个最先进的领域特定抽象式摘要模型和通用领域大型语言模型，并评估生成摘要的质量。除了使用标准摘要质量评估指标外，我们还检查了摘要中的不一致性和幻觉现象。结果表明，在ROUGE和BLEU等标准摘要评估指标上，抽象式摘要模型通常得分略高于抽取式模型。然而，生成的抽象式摘要中常发现不一致或幻觉信息。总体而言，我们的研究表明，预训练抽象式摘要模型和大型语言模型尚未完全准备好用于案例判决摘要的自动化部署；相反，当前更适宜采用包含人工检查不一致环节的人机协同方法。