Writing a scientific article is a challenging task as it is a highly codified and specific genre, consequently proficiency in written communication is essential for effectively conveying research findings and ideas. In this article, we propose an original textual resource on the revision step of the writing process of scientific articles. This new dataset, called CASIMIR, contains the multiple revised versions of 15,646 scientific articles from OpenReview, along with their peer reviews. Pairs of consecutive versions of an article are aligned at sentence-level while keeping paragraph location information as metadata for supporting future revision studies at the discourse level. Each pair of revised sentences is enriched with automatically extracted edits and associated revision intention. To assess the initial quality on the dataset, we conducted a qualitative study of several state-of-the-art text revision approaches and compared various evaluation metrics. Our experiments led us to question the relevance of the current evaluation methods for the text revision task.
翻译:撰写科学论文是一项具有挑战性的任务,因其高度规范化且专业性强,熟练的书面表达能力对于有效传达研究成果和思想至关重要。本文提出了一种关于科学论文写作过程中修订环节的原创文本资源。该新数据集名为CASIMIR,包含来自OpenReview平台的15,646篇科学论文的多个修订版本及其同行评审意见。论文连续版本对在句子层面进行对齐,同时保留段落位置信息作为元数据,以支持未来在语篇层面的修订研究。每对修订后的句子均附有自动提取的编辑内容及相关修订意图。为评估该数据集的初始质量,我们针对多种最先进的文本修订方法开展了定性研究,并比较了不同评价指标。实验结果表明,当前文本修订任务的评估方法存在值得商榷之处。