In legal decisions, split votes (SV) occur when judges cannot reach a unanimous decision, posing a difficulty for lawyers who must navigate diverse legal arguments and opinions. In high-stakes domains, understanding the alignment of perceived difficulty between humans and AI systems is crucial to build trust. However, existing NLP calibration methods focus on a classifier's awareness of predictive performance, measured against the human majority class, overlooking inherent human label variation (HLV). This paper explores split votes as naturally observable human disagreement and value pluralism. We collect judges' vote distributions from the European Court of Human Rights (ECHR), and present SV-ECHR, a case outcome classification (COC) dataset with SV information. We build a taxonomy of disagreement with SV-specific subcategories. We further assess the alignment of perceived difficulty between models and humans, as well as confidence- and human-calibration of COC models. We observe limited alignment with the judge vote distribution. To our knowledge, this is the first systematic exploration of calibration to human judgements in legal NLP. Our study underscores the necessity for further research on measuring and enhancing model calibration considering HLV in legal decision tasks.
翻译:在法律决策中,当法官无法达成一致裁决时会出现分裂投票(SV),这给必须应对多样化法律论点和意见的律师带来了难度。在高风险领域,理解人类与人工智能系统在感知难度上的一致性对于建立信任至关重要。然而,现有的自然语言处理校准方法专注于分类器对预测性能的认知(以人类多数类别为衡量标准),忽视了固有的人类标注变异(HLV)。本文探索将分裂投票视为自然可观测的人类分歧与价值多元性。我们从欧洲人权法院(ECHR)收集法官投票分布,并提出带有分裂投票信息的案件结果分类(COC)数据集SV-ECHR。我们构建了包含分裂投票特定子类别的分歧分类体系。进一步评估了模型与人类在感知难度上的一致性,以及COC模型的置信度校准和人类校准。我们观察到模型与法官投票分布的一致性有限。据我们所知,这是法律自然语言处理领域首次系统探索针对人类判断的校准问题。本研究强调了在法律决策任务中考虑人类标注变异进行模型校准测量与改进的进一步研究必要性。