无监督指标在司法判决文本抽取评估中的比较 (Comparison of Unsupervised Metrics for Evaluating Judicial Decision Extraction)

The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1--5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin's concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.

翻译：法律自然语言处理领域人工智能的快速发展，亟需可扩展的方法来评估从司法判决中抽取文本的质量。本研究评估了16种无监督指标（包括新颖的公式化方法），用于评估从1000份匿名化俄罗斯司法判决中抽取七个语义块的质量，并以7168份专家评审（采用1-5李克特量表）进行验证。这些指标涵盖基于文档、语义、结构、伪真实标注和法律特定类别，无需预先标注的真实数据即可运行。自助法相关性分析、林氏一致性相关系数（CCC）和平均绝对误差（MAE）显示，词频一致性（Pearson $r = 0.540$，Lin CCC = 0.512，MAE = 0.127）以及覆盖率/块完整性（Pearson $r = 0.513$，Lin CCC = 0.443，MAE = 0.139）与专家评分最为一致，而法律术语密度（Pearson $r = -0.479$，Lin CCC = -0.079，MAE = 0.394）则表现出强烈的负相关性。LLM评估分数（均值 = 0.849，Pearson $r = 0.382$，Lin CCC = 0.325，MAE = 0.197）显示出中等程度的一致性，但其通过g4f使用gpt-4.1-mini的性能表明，该模型对法律文本的专业适应性有限。这些发现强调，无监督指标（包括基于LLM的方法）虽能实现可扩展的筛选，但由于其相关性中等且CCC值较低，在高风险法律情境中无法完全替代人类判断。本研究通过提供无需标注的评估工具推进了法律NLP的发展，对司法分析和人工智能伦理部署具有重要启示。

相关内容

CCC

关注 0

CCC旨在促进计算复杂性理论的所有领域的研究，研究资源约束下计算模型的绝对和相对功率。典型的模型包括确定性模型、不确定性模型、随机模型和量子模型；均匀模型和非均匀模型；布尔模型、代数模型和连续模型。典型的资源约束包括时间、空间、随机性、程序大小、输入查询、通信和纠缠；最坏情况和平均情况。其他更具体的主题包括：概率和交互证明系统、不可近似性、证明复杂性、描述复杂性以及密码和机器学习的复杂性理论方面。会议还鼓励其他领域的计算机科学和数学的动机计算复杂性理论。官网链接：http://computationalcomplexity.org/

O’Reilly报告：知识图谱崛起——面向现代数据集成和数据结构体系，“The Rise of the Knowledge Graph——Toward Modern Data Integration and the Data Fabric Architecture”

专知会员服务

49+阅读 · 2022年2月18日

UCM《机器学习导论笔记》，80页pdf CSE176 Introduction to Machine Learning

专知会员服务

31+阅读 · 2021年9月29日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

34+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日