We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.
翻译:我们提出了剑桥法律语料库(CLC),这是一个专为法律人工智能研究设计的语料库。该语料库包含超过25万份来自英国的法庭案件,其中大部分案件来自21世纪,但也包含可追溯至16世纪的古老案例。本文介绍了语料库的首次发布版本,包含原始文本和元数据。与此同时,我们提供了由法律专家标注的638个案件的判决结果注释。基于这些标注数据,我们使用GPT-3、GPT-4和RoBERTa模型训练并评估了案件结果提取任务,建立了基准测试。此外,我们开展了全面的法律与伦理讨论,以应对该材料潜在的敏感性。因此,该语料库仅在特定限制条件下面向研究目的开放。