Scientific writing is an iterative process that generates rich revision traces, yet publicly available resources typically expose only final or near-final versions of papers. This limits empirical study of revision behaviour and evaluation of large language models (LLMs) for scientific writing. We introduce EarlySciRev, a dataset of early-stage scientific text revisions automatically extracted from arXiv LaTeX source files. Our key observation is that commented-out text in LaTeX often preserves discarded or alternative formulations written by the authors themselves. By aligning commented segments with nearby final text, we extract paragraph-level candidate revision pairs and apply LLM-based filtering to retain genuine revisions. Starting from 1.28M candidate pairs, our pipeline yields 578k validated revision pairs, grounded in authentic early drafting traces. We additionally provide a human-annotated benchmark for revision detection. EarlySciRev complements existing resources focused on late-stage revisions or synthetic rewrites and supports research on scientific writing dynamics, revision modelling, and LLM-assisted editing.
翻译:科学写作是一个迭代过程,会产生丰富的修订痕迹,然而公开可用的资源通常只展示论文的最终或接近最终版本。这限制了对修订行为的实证研究以及大型语言模型(LLMs)在科学写作中的评估。我们引入了EarlySciRev,这是一个从arXiv LaTeX源文件中自动提取的早期科学文本修订数据集。我们的关键观察是,LaTeX中被注释掉的文本往往保留了作者自己撰写的被废弃或替代的表述。通过将注释片段与附近的最终文本对齐,我们提取段落级别的候选修订对,并应用基于LLM的过滤来保留真正的修订。从128万个候选对开始,我们的流程产生了57.8万个经过验证的修订对,这些修订对基于真实的早期草稿痕迹。我们还额外提供了一个用于修订检测的人工标注基准。EarlySciRev补充了现有侧重于后期修订或合成改写资源的不足,并支持对科学写作动态、修订建模和LLM辅助编辑的研究。