In legal practice, judges apply the trichotomous dogmatics of criminal law, sequentially assessing the elements of the offense, unlawfulness, and culpability to determine whether an individual's conduct constitutes a crime. Although current legal large language models (LLMs) show promising accuracy in judgment prediction, they lack trichotomous reasoning capabilities due to the absence of an appropriate benchmark dataset, preventing them from predicting innocent outcomes. As a result, every input is automatically assigned a charge, limiting their practical utility in legal contexts. To bridge this gap, we introduce LJPIV, the first benchmark dataset for Legal Judgment Prediction with Innocent Verdicts. Adhering to the trichotomous dogmatics, we extend three widely-used legal datasets through LLM-based augmentation and manual verification. Our experiments with state-of-the-art legal LLMs and novel strategies that integrate trichotomous reasoning into zero-shot prompting and fine-tuning reveal: (1) current legal LLMs have significant room for improvement, with even the best models achieving an F1 score of less than 0.3 on LJPIV; and (2) our strategies notably enhance both in-domain and cross-domain judgment prediction accuracy, especially for cases resulting in an innocent verdict.
翻译:在法律实践中,法官运用刑法的三阶层犯罪论体系,依次审查构成要件该当性、违法性与有责性,以判定个体行为是否构成犯罪。尽管当前的法律大语言模型在判决预测中展现出可观的准确率,但由于缺乏合适的基准数据集,这些模型不具备三阶层推理能力,无法预测无罪判决结果。因此,所有输入均被自动判定有罪,这限制了其在法律实务中的应用价值。为填补这一空白,我们提出了LJPIV——首个包含无罪判决的法律判决预测基准数据集。遵循三阶层犯罪论体系,我们通过基于大语言模型的数据增强与人工校验,扩展了三个广泛使用的法律数据集。通过对前沿法律大语言模型的实验,以及将三阶层推理融入零样本提示与微调过程的新策略,我们发现:(1)当前法律大语言模型仍有显著改进空间,即使在LJPIV数据集上表现最佳的模型,其F1分数也低于0.3;(2)我们提出的策略显著提升了模型在领域内与跨领域的判决预测准确率,尤其对于最终判处无罪的案件。