In legal NLP, Case Outcome Classification (COC) must not only be accurate but also trustworthy and explainable. Existing work in explainable COC has been limited to annotations by a single expert. However, it is well-known that lawyers may disagree in their assessment of case facts. We hence collect a novel dataset RAVE: Rationale Variation in ECHR1, which is obtained from two experts in the domain of international human rights law, for whom we observe weak agreement. We study their disagreements and build a two-level task-independent taxonomy, supplemented with COC-specific subcategories. To our knowledge, this is the first work in the legal NLP that focuses on human label variation. We quantitatively assess different taxonomy categories and find that disagreements mainly stem from underspecification of the legal context, which poses challenges given the typically limited granularity and noise in COC metadata. We further assess the explainablility of SOTA COC models on RAVE and observe limited agreement between models and experts. Overall, our case study reveals hitherto underappreciated complexities in creating benchmark datasets in legal NLP that revolve around identifying aspects of a case's facts supposedly relevant to its outcome.
翻译:在法律自然语言处理中,案例结果分类不仅需要确保准确性,还必须具备可信度和可解释性。现有可解释性案例结果分类研究局限于单一专家标注,然而律师对案件事实的评估存在分歧已是公认事实。为此,我们构建了新型数据集RAVE(欧洲人权公约第一条理由变异性),该数据集由两位国际人权法领域专家协作完成,且观测到两者间存在弱一致性。我们系统研究了专家分歧,构建了包含两级任务无关分类体系及案件结果分类特定子类别的分析框架。据我们所知,这是法律自然语言处理领域中首个聚焦标注差异性的研究。通过定量评估不同分类类别,我们发现分歧主要源于法律背景信息标注不足——考虑到案例结果分类元数据普遍存在的粒度稀疏性和噪声,这一发现具有特殊挑战性。我们进一步评估了当前最优案例结果分类模型在RAVE数据集上的可解释性,发现模型与专家之间的一致性有限。总体而言,本案例研究揭示了法律自然语言处理基准数据集构建中尚未被充分认知的复杂性,这种复杂性围绕如何识别对判决结果具有实质影响的案件事实层面展开。