Segmentation and Processing of German Court Decisions from Open Legal Data

from arxiv, Accepted and published as a research article in Legal Knowledge and Information Systems (JURIX 2025 proceedings, IOS Press). Pages 276--281

The availability of structured legal data is important for advancing Natural Language Processing (NLP) techniques for the German legal system. One of the most widely used datasets, Open Legal Data, provides a large-scale collection of German court decisions. While the metadata in this raw dataset is consistently structured, the decision texts themselves are inconsistently formatted and often lack clearly marked sections. Reliable separation of these sections is important not only for rhetorical role classification but also for downstream tasks such as retrieval and citation analysis. In this work, we introduce a cleaned and sectioned dataset of 251,038 German court decisions derived from the official Open Legal Data dataset. We systematically separated three important sections in German court decisions, namely Tenor (operative part of the decision), Tatbestand (facts of the case), and Entscheidungsgründe (judicial reasoning), which are often inconsistently represented in the original dataset. To ensure the reliability of our extraction process, we used Cochran's formula with a 95% confidence level and a 5% margin of error to draw a statistically representative random sample of 384 cases, and manually verified that all three sections were correctly identified. We also extracted the Rechtsmittelbelehrung (appeal notice) as a separate field, since it is a procedural instruction and not part of the decision itself. The resulting corpus is publicly available in the JSONL format, making it an accessible resource for further research on the German legal system.

翻译：结构化法律数据的可获得性对于推进德国法律系统的自然语言处理技术至关重要。开放法律数据作为最广泛使用的数据集之一，提供了大规模的德国法院判决书集合。尽管该原始数据集中的元数据结构一致，但判决文本本身的格式并不统一，且常常缺乏明确标记的章节。对这些章节进行可靠分割不仅对修辞角色分类具有重要意义，也对检索和引用分析等下游任务至关重要。本研究基于官方开放法律数据集，构建了一个包含251,038份德国法院判决书的清洁化分章节数据集。我们系统性地分离了德国法院判决书中三个重要章节：判决主文、案件事实和裁判理由，这些章节在原始数据集中往往呈现不一致。为确保提取过程的可靠性，我们采用置信水平95%、误差幅度5%的科克伦公式，抽取了384个具有统计代表性的随机样本，并人工验证了所有三个章节的正确识别。同时，我们将上诉告知作为独立字段提取，因其属于程序性指示而非判决本身组成部分。最终生成的语料库以JSONL格式公开提供，为德国法律系统的进一步研究提供了可访问的资源。