In this study, we analyze data-scarce classification scenarios, where available labeled legal data is small and imbalanced, potentially hurting the quality of the results. We focused on two finetuning objectives; SetFit (Sentence Transformer Finetuning), a contrastive learning setup, and a vanilla finetuning setup on a legal provision classification task. Additionally, we compare the features that are extracted with LIME (Local Interpretable Model-agnostic Explanations) to see which particular features contributed to the model's classification decisions. The results show that a contrastive setup with SetFit performed better than vanilla finetuning while using a fraction of the training samples. LIME results show that the contrastive learning approach helps boost both positive and negative features which are legally informative and contribute to the classification results. Thus a model finetuned with a contrastive objective seems to base its decisions more confidently on legally informative features.
翻译:在本研究中,我们分析了数据稀缺分类场景,其中可用的标记法律数据规模小且不平衡,可能影响结果质量。我们聚焦于两种微调目标:SetFit(句子Transformer微调),一种对比学习设置,以及在法律条款分类任务上的传统微调设置。此外,我们比较了通过LIME(局部可解释模型无关解释)提取的特征,以观察哪些特定特征贡献于模型的分类决策。结果表明,使用部分训练样本时,基于SetFit的对比设置优于传统微调。LIME结果显示,对比学习方法有助于增强具有法律信息性的正负特征,这些特征有助于分类结果。因此,通过对比目标微调的模型似乎更自信地基于法律信息特征做出决策。