In this study, we focus on two main tasks, the first for detecting legal violations within unstructured textual data, and the second for associating these violations with potentially affected individuals. We constructed two datasets using Large Language Models (LLMs) which were subsequently validated by domain expert annotators. Both tasks were designed specifically for the context of class-action cases. The experimental design incorporated fine-tuning models from the BERT family and open-source LLMs, and conducting few-shot experiments using closed-source LLMs. Our results, with an F1-score of 62.69\% (violation identification) and 81.02\% (associating victims), show that our datasets and setups can be used for both tasks. Finally, we publicly release the datasets and the code used for the experiments in order to advance further research in the area of legal natural language processing (NLP).
翻译:本研究聚焦于两项主要任务:第一项任务旨在检测非结构化文本数据中的法律违规行为,第二项任务则将这些违规行为与潜在受影响个体相关联。我们利用大语言模型构建了两个数据集,并随后由领域专家标注员进行了验证。这两项任务均针对集体诉讼案件的具体场景而设计。实验方案包括对BERT系列模型及开源大语言模型进行微调,并针对闭源大语言模型开展少样本实验。研究结果显示,违规识别任务的F1得分为62.69%,受害者关联任务的F1得分为81.02%,这表明我们的数据集与实验设置可有效支撑这两项任务。最后,我们公开了数据集及实验代码,以推动法律自然语言处理领域的进一步研究。