The rapid advancement of artificial intelligence in legal natural language processing demands scalable methods for evaluating text extraction from judicial decisions. This study evaluates 16 unsupervised metrics, including novel formulations, to assess the quality of extracting seven semantic blocks from 1,000 anonymized Russian judicial decisions, validated against 7,168 expert reviews on a 1--5 Likert scale. These metrics, spanning document-based, semantic, structural, pseudo-ground truth, and legal-specific categories, operate without pre-annotated ground truth. Bootstrapped correlations, Lin's concordance correlation coefficient (CCC), and mean absolute error (MAE) reveal that Term Frequency Coherence (Pearson $r = 0.540$, Lin CCC = 0.512, MAE = 0.127) and Coverage Ratio/Block Completeness (Pearson $r = 0.513$, Lin CCC = 0.443, MAE = 0.139) best align with expert ratings, while Legal Term Density (Pearson $r = -0.479$, Lin CCC = -0.079, MAE = 0.394) show strong negative correlations. The LLM Evaluation Score (mean = 0.849, Pearson $r = 0.382$, Lin CCC = 0.325, MAE = 0.197) showed moderate alignment, but its performance, using gpt-4.1-mini via g4f, suggests limited specialization for legal textse. These findings highlight that unsupervised metrics, including LLM-based approaches, enable scalable screening but, with moderate correlations and low CCC values, cannot fully replace human judgment in high-stakes legal contexts. This work advances legal NLP by providing annotation-free evaluation tools, with implications for judicial analytics and ethical AI deployment.
翻译:法律自然语言处理领域人工智能的快速发展,亟需可扩展的方法来评估从司法判决中抽取文本的质量。本研究评估了16种无监督指标(包括新颖的公式化方法),用于评估从1000份匿名化俄罗斯司法判决中抽取七个语义块的质量,并以7168份专家评审(采用1-5李克特量表)进行验证。这些指标涵盖基于文档、语义、结构、伪真实标注和法律特定类别,无需预先标注的真实数据即可运行。自助法相关性分析、林氏一致性相关系数(CCC)和平均绝对误差(MAE)显示,词频一致性(Pearson $r = 0.540$,Lin CCC = 0.512,MAE = 0.127)以及覆盖率/块完整性(Pearson $r = 0.513$,Lin CCC = 0.443,MAE = 0.139)与专家评分最为一致,而法律术语密度(Pearson $r = -0.479$,Lin CCC = -0.079,MAE = 0.394)则表现出强烈的负相关性。LLM评估分数(均值 = 0.849,Pearson $r = 0.382$,Lin CCC = 0.325,MAE = 0.197)显示出中等程度的一致性,但其通过g4f使用gpt-4.1-mini的性能表明,该模型对法律文本的专业适应性有限。这些发现强调,无监督指标(包括基于LLM的方法)虽能实现可扩展的筛选,但由于其相关性中等且CCC值较低,在高风险法律情境中无法完全替代人类判断。本研究通过提供无需标注的评估工具推进了法律NLP的发展,对司法分析和人工智能伦理部署具有重要启示。