Writing is a cognitively demanding activity that requires constant decision-making, heavy reliance on working memory, and frequent shifts between tasks of different goals. To build writing assistants that truly align with writers' cognition, we must capture and decode the complete thought process behind how writers transform ideas into final texts. We present ScholaWrite, the first dataset of end-to-end scholarly writing, tracing the multi-month journey from initial drafts to final manuscripts. We contribute three key advances: (1) a Chrome extension that unobtrusively records keystrokes on Overleaf, enabling the collection of realistic, in-situ writing data; (2) a novel corpus of full scholarly manuscripts, enriched with fine-grained annotations of cognitive writing intentions. The dataset includes \LaTeX-based edits from five computer science preprints, capturing nearly 62K text changes over four months; and (3) analyses and insights into the micro-dynamics of scholarly writing, highlighting gaps between human writing processes and the current capabilities of large language models (LLMs) in providing meaningful assistance. ScholaWrite underscores the value of capturing end-to-end writing data to develop future writing assistants that support, not replace, the cognitive work of scientists.
翻译:写作是一项认知要求极高的活动,需要持续决策、高度依赖工作记忆,并频繁在不同目标的任务间切换。为构建真正与写作者认知过程相契合的写作辅助工具,我们必须捕捉并解码写作者将思想转化为最终文本背后的完整思维过程。本文提出ScholaWrite,首个端到端学术写作数据集,追踪了从初稿到终稿长达数月的完整历程。我们贡献了三项关键进展:(1) 一款Chrome浏览器扩展程序,可在Overleaf上无干扰地记录击键操作,从而收集真实、原位写作数据;(2) 一个包含完整学术手稿的新型语料库,并辅以细粒度的认知写作意图标注。该数据集收录了五篇计算机科学预印本基于\LaTeX的编辑记录,在四个月内捕获了近62,000次文本变更;(3) 对学术写作微观动态的分析与洞见,揭示了人类写作过程与当前大语言模型(LLMs)在提供有效辅助能力之间的差距。ScholaWrite强调了捕获端到端写作数据对于开发未来写作辅助工具的价值——这些工具旨在支持而非替代科学家的认知工作。