Many populous countries including India are burdened with a considerable backlog of legal cases. Development of automated systems that could process legal documents and augment legal practitioners can mitigate this. However, there is a dearth of high-quality corpora that is needed to develop such data-driven systems. The problem gets even more pronounced in the case of low resource languages such as Hindi. In this resource paper, we introduce the Hindi Legal Documents Corpus (HLDC), a corpus of more than 900K legal documents in Hindi. Documents are cleaned and structured to enable the development of downstream applications. Further, as a use-case for the corpus, we introduce the task of bail prediction. We experiment with a battery of models and propose a Multi-Task Learning (MTL) based model for the same. MTL models use summarization as an auxiliary task along with bail prediction as the main task. Experiments with different models are indicative of the need for further research in this area. We release the corpus and model implementation code with this paper: https://github.com/Exploration-Lab/HLDC
翻译:包括印度在内的许多人口大国都承受着大量法律案件积压的负担。开发能够处理法律文档并辅助法律从业者的自动化系统可以缓解这一问题。然而,目前缺乏开发此类数据驱动系统所需的高质量语料库。对于印地语等低资源语言,这一问题尤为突出。在本资源论文中,我们介绍了印地语法律文档语料库(HLDC),这是一个包含超过90万份印地语法律文档的语料库。文档经过清理和结构化,以支持下游应用的开发。此外,作为该语料库的一个用例,我们引入了保释预测任务。我们测试了一系列模型,并提出了一种基于多任务学习(MTL)的模型。MTL模型将摘要作为辅助任务,与作为主任务的保释预测任务相结合。不同模型的实验结果表明,该领域需要进一步的研究。我们随本文发布了语料库和模型实现代码:https://github.com/Exploration-Lab/HLDC