This study investigates the challenges posed by the dynamic nature of legal multi-label text classification tasks, where legal concepts evolve over time. Existing models often overlook the temporal dimension in their training process, leading to suboptimal performance of those models over time, as they treat training data as a single homogeneous block. To address this, we introduce ChronosLex, an incremental training paradigm that trains models on chronological splits, preserving the temporal order of the data. However, this incremental approach raises concerns about overfitting to recent data, prompting an assessment of mitigation strategies using continual learning and temporal invariant methods. Our experimental results over six legal multi-label text classification datasets reveal that continual learning methods prove effective in preventing overfitting thereby enhancing temporal generalizability, while temporal invariant methods struggle to capture these dynamics of temporal shifts.
翻译:本研究探讨了法律多标签文本分类任务中由法律概念随时间演变所带来的挑战。现有模型在训练过程中往往忽视时间维度,将训练数据视为单一同质块,导致这些模型随时间推移性能下降。为解决此问题,我们提出了ChronosLex——一种按时间顺序分割数据进行训练的增量训练范式,以保持数据的时间顺序。然而,这种增量方法可能引发对近期数据过拟合的担忧,为此我们评估了采用持续学习和时序不变方法的缓解策略。在六个法律多标签文本分类数据集上的实验结果表明:持续学习方法能有效防止过拟合从而提升时序泛化能力,而时序不变方法难以捕捉这种时序漂移的动态特性。