NLP in the legal domain has seen increasing success with the emergence of Transformer-based Pre-trained Language Models (PLMs) pre-trained on legal text. PLMs trained over European and US legal text are available publicly; however, legal text from other domains (countries), such as India, have a lot of distinguishing characteristics. With the rapidly increasing volume of Legal NLP applications in various countries, it has become necessary to pre-train such LMs over legal text of other countries as well. In this work, we attempt to investigate pre-training in the Indian legal domain. We re-train (continue pre-training) two popular legal PLMs, LegalBERT and CaseLawBERT, on Indian legal data, as well as train a model from scratch with a vocabulary based on Indian legal text. We apply these PLMs over three benchmark legal NLP tasks -- Legal Statute Identification from facts, Semantic Segmentation of Court Judgment Documents, and Court Appeal Judgment Prediction -- over both Indian and non-Indian (EU, UK) datasets. We observe that our approach not only enhances performance on the new domain (Indian texts) but also over the original domain (European and UK texts). We also conduct explainability experiments for a qualitative comparison of all these different PLMs.
翻译:自然语言处理在法律领域随着基于Transformer的预训练语言模型在法律文本上的预训练而取得越来越多的成功。针对欧洲和美国法律文本预训练的PLMs已公开可用;然而,来自其他领域(国家)的法律文本,如印度,具有许多显著特征。随着各国法律NLP应用数量的迅速增长,有必要在其他国家的法律文本上预训练此类语言模型。本研究尝试探究印度法律领域的预训练。我们在印度法律数据上重新训练(持续预训练)两种流行的法律PLMs——LegalBERT和CaseLawBERT,并基于印度法律文本的词汇从头训练一个模型。我们在三个基准法律NLP任务——基于事实的法律法规识别、法院判决文书的语义分割和法院上诉判决预测——上应用这些PLMs,涵盖印度和非印度(欧盟、英国)数据集。我们观察到,我们的方法不仅在新领域(印度文本)上提升了性能,在原始领域(欧洲和英国文本)上也同样有效。我们还进行了可解释性实验,以对这些不同的PLMs进行定性比较。