The importance of clear and correct text in legal documents cannot be understated, and, consequently, a grammatical error correction tool meant to assist a professional in the law must have the ability to understand the possible errors in the context of a legal environment, correcting them accordingly, and implicitly needs to be trained in the same environment, using realistic legal data. However, the manually annotated data required by such a process is in short supply for languages such as Romanian, much less for a niche domain. The most common approach is the synthetic generation of parallel data; however, it requires a structured understanding of the Romanian grammar. In this paper, we introduce, to our knowledge, the first Romanian-language parallel dataset for the detection and correction of grammatical errors in the legal domain, RoLegalGEC, which aggregates 350,000 examples of errors in legal passages, along with error annotations. Moreover, we evaluate several neural network models that transform the dataset into a valuable tool for both detecting and correcting grammatical errors, including knowledge-distillation Transformers, sequence tagging architectures for detection, and a variety of pre-trained text-to-text Transformer models for correction. We consider that the set of models, together with the novel RoLegalGEC dataset, will enrich the resource base for further research on Romanian.
翻译:法律文件中文本的清晰与正确性至关重要,因此,旨在协助法律专业人士的语法错误校正工具必须能够理解法律语境中可能出现的错误并予以相应校正,且隐含要求其需在相同环境中使用真实的法律数据进行训练。然而,对于罗马尼亚语等语言而言,此类流程所需的人工标注数据严重匮乏,更遑论针对特定专业领域。合成生成平行数据是最常见的应对方法,但这需要对其语法有系统化的理解。在本文中,我们首次提出用于法律领域语法错误检测与校正的罗马尼亚语平行数据集——RoLegalGEC。该数据集汇集了35万个法律文本段落中的错误示例及其错误标注。此外,我们评估了多种神经网络模型,将本数据集转化为检测与校正语法错误的有力工具,包括知识蒸馏Transformer、用于错误检测的序列标注架构,以及多种预训练的文本到文本Transformer校正模型。我们相信,本模型集合与新颖的RoLegalGEC数据集将共同丰富罗马尼亚语后续研究的资源基础。