Sentence Boundary Detection (SBD) is one of the foundational building blocks of Natural Language Processing (NLP), with incorrectly split sentences heavily influencing the output quality of downstream tasks. It is a challenging task for algorithms, especially in the legal domain, considering the complex and different sentence structures used. In this work, we curated a diverse multilingual legal dataset consisting of over 130'000 annotated sentences in 6 languages. Our experimental results indicate that the performance of existing SBD models is subpar on multilingual legal data. We trained and tested monolingual and multilingual models based on CRF, BiLSTM-CRF, and transformers, demonstrating state-of-the-art performance. We also show that our multilingual models outperform all baselines in the zero-shot setting on a Portuguese test set. To encourage further research and development by the community, we have made our dataset, models, and code publicly available.
翻译:句子边界检测(SBD)是自然语言处理(NLP)的基础模块之一,错误分割的句子会严重影响下游任务的输出质量。由于法律领域句子结构复杂且多样,这对算法而言是一项具有挑战性的任务。本研究构建了一个包含6种语言、超过13万条标注句子的多样化多语言法律数据集。实验结果表明,现有SBD模型在多语言法律数据上的表现欠佳。我们基于CRF、BiLSTM-CRF和transformer训练并测试了单语言及多语言模型,取得了当前最优性能。同时,在葡萄牙语测试集上的零样本设置中,我们的多语言模型全面超越所有基线模型。为促进社区的进一步研究与发展,我们已将数据集、模型及代码公开发布。