In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.
翻译:在法律自然语言处理中,案件结果分类不仅需要准确性,还需具备可信度与可解释性。现有可解释案件结果分类研究局限于单一专家标注。然而,律师对案件事实的评估常存在分歧。为此,我们构建了新型数据集RAVE(欧洲人权法院理由变异性数据集),由两名国际人权法领域专家标注,其标注一致性较弱。我们研究其分歧,并建立基于两级任务无关分类体系、辅以案件结果分类特定子类别的框架。据我们所知,这是法律自然语言处理领域首项关注人类标注变异的研究。我们定量评估不同分类类别后发现,分歧主要源于法律语境信息不足——鉴于案件结果分类元数据通常粒度粗、噪声大,这一发现凸显挑战。进一步评估现有最优案件结果分类模型在RAVE上的可解释性发现,模型与专家间一致性有限。本案例研究揭示了法律自然语言处理中构建基准数据集时一个长期未受充分重视的复杂性:即需要识别案件事实中与结果可能相关的关键方面。