NLP in the legal domain has seen increasing success with the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. PLMs trained over European and US legal text are available publicly; however, legal text from other domains (countries), such as India, have a lot of distinguishing characteristics. With the rapidly increasing volume of Legal NLP applications in various countries, it has become necessary to pre-train such LMs over legal text of other countries as well. In this work, we attempt to investigate pre-training in the Indian legal domain. We re-train (continue pre-training) two popular legal PLMs, LegalBERT and CaseLawBERT, on Indian legal data, as well as train a model from scratch with a vocabulary based on Indian legal text. We apply these PLMs over three benchmark legal NLP tasks -- Legal Statute Identification from facts, Semantic Segmentation of Court Judgment Documents, and Court Appeal Judgment Prediction -- over both Indian and non-Indian (EU, UK) datasets. We observe that our approach not only enhances performance on the new domain (Indian texts) but also over the original domain (European and UK texts). We also conduct explainability experiments for a qualitative comparison of all these different PLMs.
翻译:随着基于Transformer的预训练语言模型在法律文本上的预训练取得显著进展,法律领域的自然语言处理应用日益成功。针对欧洲和美国法律文本训练的预训练语言模型已公开可用;然而,来自其他地区(国家)的法律文本,例如印度,具有诸多独特特征。随着各国法律NLP应用数量的快速增长,针对其他国家法律文本预训练此类语言模型已变得必要。本文旨在探索印度法律领域的预训练研究。我们基于印度法律数据重新训练(即持续预训练)了两个流行的法律领域预训练语言模型(LegalBERT和CaseLawBERT),并基于印度法律文本构建词汇从头训练了一个模型。我们将这些模型应用于三个基准法律NLP任务——基于事实的法律法规识别、法院判决文档的语义分割以及法院上诉判决预测——涵盖印度及非印度(欧盟、英国)数据集。实验表明,我们的方法不仅提升了在新领域(印度文本)上的性能,而且在原始领域(欧洲和英国文本)上也有所改进。此外,我们还进行了可解释性实验,以定性比较这些不同预训练语言模型的表现。