Segmentation and Rhetorical Role Labeling of legal judgements play a crucial role in retrieval and adjacent tasks, including case summarization, semantic search, argument mining etc. Previous approaches have formulated this task either as independent classification or sequence labeling of sentences. In this work, we reformulate the task at span level as identifying spans of multiple consecutive sentences that share the same rhetorical role label to be assigned via classification. We employ semi-Markov Conditional Random Fields (CRF) to jointly learn span segmentation and span label assignment. We further explore three data augmentation strategies to mitigate the data scarcity in the specialized domain of law where individual documents tend to be very long and annotation cost is high. Our experiments demonstrate improvement of span-level prediction metrics with a semi-Markov CRF model over a CRF baseline. This benefit is contingent on the presence of multi sentence spans in the document.
翻译:法律判决文书的跨度分割与修辞角色标注在检索及下游任务(如案例摘要生成、语义检索、论点挖掘等)中具有关键作用。现有方法通常将该任务建模为句子级别的独立分类或序列标注问题。本研究在跨度层面重新定义该任务:通过分类方法识别出共享同一修辞角色标签的连续多句跨度。我们采用半马尔可夫条件随机场(CRF)实现跨度分割与标签分配的联合学习。针对法律领域专业性强、单篇文档篇幅长且标注成本高的数据稀缺问题,进一步探索了三种数据增强策略。实验表明,基于半马尔可夫CRF模型的跨度级预测指标优于CRF基线模型,该优势取决于文档中是否存在多句跨度。