Legal case retrieval is a critical process for modern legal information systems. While recent studies have utilized pre-trained language models (PLMs) based on the general domain self-supervised pre-training paradigm to build models for legal case retrieval, there are limitations in using general domain PLMs as backbones. Specifically, these models may not fully capture the underlying legal features in legal case documents. To address this issue, we propose CaseEncoder, a legal document encoder that leverages fine-grained legal knowledge in both the data sampling and pre-training phases. In the data sampling phase, we enhance the quality of the training data by utilizing fine-grained law article information to guide the selection of positive and negative examples. In the pre-training phase, we design legal-specific pre-training tasks that align with the judging criteria of relevant legal cases. Based on these tasks, we introduce an innovative loss function called Biased Circle Loss to enhance the model's ability to recognize case relevance in fine grains. Experimental results on multiple benchmarks demonstrate that CaseEncoder significantly outperforms both existing general pre-training models and legal-specific pre-training models in zero-shot legal case retrieval.
翻译:法律案例检索是现代法律信息系统中的关键环节。尽管近期研究已采用基于通用领域自监督预训练范式的预训练语言模型(PLMs)来构建法律案例检索模型,但使用通用领域PLMs作为骨干网络存在局限性——这些模型可能无法充分捕捉法律案例文档中的潜在法律特征。为解决此问题,我们提出CaseEncoder,一种在数据采样和预训练阶段均利用细粒度法律知识的法律文档编码器。在数据采样阶段,我们通过借助细粒度法律条文信息指导正负样本选择,提升训练数据质量;在预训练阶段,我们设计了符合相关法律案例判案标准的法律专用预训练任务。基于这些任务,我们创新性地提出名为偏置圈损失(Biased Circle Loss)的损失函数,以增强模型对案例关联性的细粒度识别能力。多个基准实验结果表明,CaseEncoder在零样本法律案例检索任务中显著优于现有通用预训练模型和法律专用预训练模型。