The objective of legal text entailment is to ascertain whether the assertions in a legal query logically follow from the information provided in one or multiple legal articles. ChatGPT, a large language model, is robust in many natural language processing tasks, including legal text entailment: when we set the temperature = 0 (the ChatGPT answers are deterministic) and prompt the model, it achieves 70.64% accuracy on COLIEE 2022 dataset, which outperforms the previous SOTA of 67.89%. On the other hand, if the temperature is larger than zero, ChatGPT answers are not deterministic, leading to inconsistent answers and fluctuating results. We propose to leverage label models (a fundamental component of weak supervision techniques) to integrate the provisional answers by ChatGPT into consolidated labels. By that way, we treat ChatGPT provisional answers as noisy predictions which can be consolidated by label models. The experimental results demonstrate that this approach can attain an accuracy of 76.15%, marking a significant improvement of 8.26% over the prior state-of-the-art benchmark. Additionally, we perform an analysis of the instances where ChatGPT produces incorrect answers, then we classify the errors, offering insights that could guide potential enhancements for future research endeavors.
翻译:法律文本蕴含的目标是判断法律查询中的陈述是否逻辑上遵循一个或多个法律条款所提供的信息。ChatGPT作为一种大型语言模型,在诸多自然语言处理任务中表现强劲,包括法律文本蕴含:当我们将温度参数设为0(此时ChatGPT回答具有确定性)并提示该模型时,它在COLIEE 2022数据集上达到了70.64%的准确率,超越了此前67.89%的最优结果。另一方面,若温度参数大于零,ChatGPT回答则非确定性,导致答案不一致且结果波动。我们提出利用标签模型(弱监督技术的基本组成部分)将ChatGPT的临时回答整合为统一标签。通过这种方式,我们将ChatGPT的临时回答视为可通过标签模型整合的噪声预测。实验结果表明,该方法可实现76.15%的准确率,较先前最优基准显著提升8.26%。此外,我们对ChatGPT生成错误答案的实例进行了分析,并对错误进行分类,为未来研究工作的潜在改进提供了见解。