We introduce the Cambridge Law Corpus (CLC), a corpus for legal AI research. It consists of over 250 000 court cases from the UK. Most cases are from the 21st century, but the corpus includes cases as old as the 16th century. This paper presents the first release of the corpus, containing the raw text and meta-data. Together with the corpus, we provide annotations on case outcomes for 638 cases, done by legal experts. Using our annotated data, we have trained and evaluated case outcome extraction with GPT-3, GPT-4 and RoBERTa models to provide benchmarks. We include an extensive legal and ethical discussion to address the potentially sensitive nature of this material. As a consequence, the corpus will only be released for research purposes under certain restrictions.
翻译:我们提出剑桥法律语料库(CLC),这是一个面向法律人工智能研究的语料资源。该语料库包含超过25万份来自英国的法庭判例,其中大部分为21世纪的案件,但也收录了可追溯至16世纪的古老判例。本文展示了该语料库的首个发布版本,包含原始文本及元数据。此外,我们提供了由法律专家完成的638份案件判决结果的标注数据。基于这些标注数据,我们使用GPT-3、GPT-4及RoBERTa模型进行了案件判决结果抽取的训练与评估,并建立了基准测试结果。考虑到该语料材料可能涉及敏感性质,我们开展了详尽的法律与伦理讨论。因此,该语料库将仅在特定限制条件下供研究用途发布。