In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.
翻译:在法律自然语言处理领域,案例结果分类不仅需要准确性,还需具备可信度与可解释性。现有可解释案例结果分类工作局限于单一专家标注。然而,众所周知,律师对案件事实的评估可能存在分歧。为此,我们构建了新型数据集RAVE:欧洲人权法院理由变异性数据集,该数据集由两位国际人权法领域专家标注,且专家间一致性较弱。我们系统研究标注分歧,建立了包含任务无关的两级分类体系,并补充了面向具体案例结果分类的子类别。据我们所知,这是法律自然语言处理领域首项聚焦人工标注变异性的研究。我们定量评估不同分类类别后发现,分歧主要源于法律语境的不充分描述——鉴于案例结果分类元数据通常存在粒度有限且包含噪声的特点,这一问题尤为突出。我们进一步评估了当前最优案例结果分类模型在RAVE上的可解释性,发现模型与专家间一致性有限。总体而言,本案例研究揭示了法律自然语言处理基准数据集构建中那些至今未被充分认识的复杂性——其核心在于如何识别案件事实中被认为与判决结果相关的关键要素。