This paper presents RISC, an open-source Python package data generator (https://github.com/GRAAL-Research/risc). RISC generates look-alike automobile insurance contracts based on the Quebec regulatory insurance form in French and English. Insurance contracts are 90 to 100 pages long and use complex legal and insurance-specific vocabulary for a layperson. Hence, they are a much more complex class of documents than those in traditional NLP corpora. Therefore, we introduce RISCBAC, a Realistic Insurance Synthetic Bilingual Automobile Contract dataset based on the mandatory Quebec car insurance contract. The dataset comprises 10,000 French and English unannotated insurance contracts. RISCBAC enables NLP research for unsupervised automatic summarisation, question answering, text simplification, machine translation and more. Moreover, it can be further automatically annotated as a dataset for supervised tasks such as NER
翻译:摘要:本文介绍RISC,一个开源的Python包数据生成器(https://github.com/GRAAL-Research/risc)。RISC基于魁北克省监管保险表格(法语和英语版)生成逼真的汽车保险合同。这些保险合同长度为90至100页,使用面向非专业人士的复杂法律及保险专用词汇,因此属于比传统NLP语料库中文本更复杂的文档类别。为此,我们基于魁北克强制汽车保险合同构建了RISCBAC(逼真保险合成双语汽车合同数据集)。该数据集包含10,000份法语及英语未标注保险合同,能够支持无监督自动摘要、问答、文本简化、机器翻译等NLP研究。此外,该数据集可进一步自动标注为NER等监督任务数据集。