Summarizing Indian legal court judgments is a complex task not only due to the intricate language and unstructured nature of the legal texts, but also since a large section of the Indian population does not understand the complex English in which legal text is written, thus requiring summaries in Indian languages. In this study, we aim to improve the summarization of Indian legal text to generate summaries in both English and Hindi (the most widely spoken Indian language), by injecting domain knowledge into diverse summarization models. We propose a framework to enhance extractive neural summarization models by incorporating domain-specific pre-trained encoders tailored for legal texts. Further, we explore the injection of legal domain knowledge into generative models (including Large Language Models) through continual pre-training on large legal corpora in English and Hindi. Our proposed approaches achieve statistically significant improvements in both English-to-English and English-to-Hindi Indian legal document summarization, as measured by standard evaluation metrics, factual consistency metrics, and legal domain-specific metrics. Furthermore, these improvements are validated through domain experts, demonstrating the effectiveness of our approaches.
翻译:摘要:印度法律法庭判决的摘要任务十分复杂,这不仅源于法律文本语言的复杂性和非结构化特性,还因为印度大部分人口无法理解撰写法律文本所使用的复杂英语,因此需要提供印度语言版本的摘要。在本研究中,我们旨在通过向多种摘要模型注入领域知识,改进印度法律文本的摘要生成,以同时产生英语和印地语(印度使用最广泛的语言)摘要。我们提出了一个框架,通过整合针对法律文本定制的领域特定预训练编码器,来增强抽取式神经摘要模型。此外,我们还探索了通过在大型英语和印地语法律语料库上进行持续预训练,将法律领域知识注入生成式模型(包括大型语言模型)。我们提出的方法在英语到英语以及英语到印地语的印度法律文档摘要任务中,均取得了统计学上显著的改进,这通过标准评估指标、事实一致性指标以及法律领域特定指标得以衡量。此外,这些改进得到了领域专家的验证,证明了我们方法的有效性。