Writing is a cognitively demanding task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities. Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge. To fully understand writers' cognitive thought process, one should fully decode the end-to-end writing data (from individual ideas to final manuscript) and understand their complex cognitive mechanisms in scholarly writing. We introduce ScholaWrite dataset, a first-of-its-kind keystroke corpus of an end-to-end scholarly writing process for complete manuscripts, with thorough annotations of cognitive writing intentions behind each keystroke. Our dataset includes LaTeX-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing. ScholaWrite shows promising usability and applications (e.g., iterative self-writing), demonstrating the importance of collection of end-to-end writing data, rather than the final manuscript, for the development of future writing assistants to support the cognitive thinking process of scientists. Our de-identified data examples and code are available on our project page.
翻译:写作是一项认知需求极高的任务,涉及持续决策、工作记忆的频繁使用以及在多种活动间的频繁切换。学术写作尤其复杂,因为作者需要协调多种形式的知识片段。要全面理解写作者的认知思维过程,必须完整解码端到端的写作数据(从个体想法到最终稿件),并理解其在学术写作中复杂的认知机制。我们介绍了ScholaWrite数据集,这是一个首创的完整手稿端到端学术写作过程的击键语料库,对每次击键背后的认知写作意图进行了详尽标注。我们的数据集包含五篇预印本的基于LaTeX的击键数据,总计近62K次文本修改及跨越四个月论文写作的标注。ScholaWrite展示了良好的可用性和应用前景(例如迭代式自我写作),证明了收集端到端写作数据(而非仅最终稿件)对于开发未来写作助手以支持科学家认知思维过程的重要性。我们的去标识化数据示例和代码可在项目页面获取。