We present DALE, a novel and effective generative Data Augmentation framework for low-resource LEgal NLP. DALE addresses the challenges existing frameworks pose in generating effective data augmentations of legal documents - legal language, with its specialized vocabulary and complex semantics, morphology, and syntax, does not benefit from data augmentations that merely rephrase the source sentence. To address this, DALE, built on an Encoder-Decoder Language Model, is pre-trained on a novel unsupervised text denoising objective based on selective masking - our masking strategy exploits the domain-specific language characteristics of templatized legal documents to mask collocated spans of text. Denoising these spans helps DALE acquire knowledge about legal concepts, principles, and language usage. Consequently, it develops the ability to generate coherent and diverse augmentations with novel contexts. Finally, DALE performs conditional generation to generate synthetic augmentations for low-resource Legal NLP tasks. We demonstrate the effectiveness of DALE on 13 datasets spanning 6 tasks and 4 low-resource settings. DALE outperforms all our baselines, including LLMs, qualitatively and quantitatively, with improvements of 1%-50%.
翻译:我们提出DALE,一个新颖且高效的面向低资源法律NLP的生成式数据增强框架。DALE旨在解决现有框架在法律文档数据增强中面临的挑战——法律语言因其专业词汇、复杂语义、形态及句法特征,单纯改写源句的数据增强方法难以奏效。为此,DALE基于编码器-解码器语言模型构建,通过一种新颖的无监督文本去噪目标进行预训练,该目标基于选择性掩码策略——利用模板化法律文档的领域特定语言特征,对文本中搭配片段进行掩码。去噪这些片段有助于DALE习得法律概念、原理及语言使用知识,从而使其具备生成连贯且多样化新语境增强文本的能力。最终,DALE通过条件生成为低资源法律NLP任务生成合成增强数据。我们在涵盖6项任务、4种低资源设置的13个数据集上验证了DALE的有效性。DALE在定性和定量上均超越所有基线(包括大语言模型),性能提升幅度达1%-50%。