Writing is a cognitively demanding task involving continuous decision-making, heavy use of working memory, and frequent switching between multiple activities. Scholarly writing is particularly complex as it requires authors to coordinate many pieces of multiform knowledge. To fully understand writers' cognitive thought process, one should fully decode the end-to-end writing data (from individual ideas to final manuscript) and understand their complex cognitive mechanisms in scholarly writing. We introduce ScholaWrite dataset, the first-of-its-kind keystroke logs of an end-to-end scholarly writing process for complete manuscripts, with thorough annotations of cognitive writing intentions behind each keystroke. Our dataset includes LaTeX-based keystroke data from five preprints with nearly 62K total text changes and annotations across 4 months of paper writing. ScholaWrite shows promising usability and applications (e.g., iterative self-writing) for the future development of AI writing assistants for academic research, which necessitate complex methods beyond LLM prompting. Our experiments clearly demonstrated the importance of collection of end-to-end writing data, rather than the final manuscript, for the development of future writing assistants to support the cognitive thinking process of scientists. Our de-identified dataset, demo, and code repository are available on our project page.
翻译:写作是一项认知要求极高的任务,涉及持续的决策、工作记忆的频繁使用以及在多种活动间的频繁切换。学术写作尤其复杂,因为它要求作者协调多种形式的知识片段。为充分理解写作者的认知思维过程,必须完整解码端到端的写作数据(从个体想法到最终手稿),并理解其在学术写作中复杂的认知机制。我们介绍了ScholaWrite数据集,这是首个针对完整手稿的端到端学术写作过程的击键日志数据集,并对每次击键背后的认知写作意图进行了详尽标注。我们的数据集包含五篇预印本的LaTeX击键数据,总计近62,000次文本修改及跨越四个月论文写作过程的标注。ScholaWrite展示了未来面向学术研究的AI写作助手开发中(例如迭代式自我写作)的潜在可用性与应用前景,这些应用需要超越大型语言模型提示的复杂方法。我们的实验明确表明,收集端到端写作数据(而非仅最终手稿)对于开发未来支持科学家认知思维过程的写作助手至关重要。我们的去标识化数据集、演示及代码仓库已在项目页面公开。