Scholarly writing presents a complex space that generally follows a methodical procedure to plan and produce both rationally sound and creative compositions. Recent works involving large language models (LLM) demonstrate considerable success in text generation and revision tasks; however, LLMs still struggle to provide structural and creative feedback on the document level that is crucial to academic writing. In this paper, we introduce a novel taxonomy that categorizes scholarly writing behaviors according to intention, writer actions, and the information types of the written data. We also provide ManuScript, an original dataset annotated with a simplified version of our taxonomy to show writer actions and the intentions behind them. Motivated by cognitive writing theory, our taxonomy for scientific papers includes three levels of categorization in order to trace the general writing flow and identify the distinct writer activities embedded within each higher-level process. ManuScript intends to provide a complete picture of the scholarly writing process by capturing the linearity and non-linearity of writing trajectory, such that writing assistants can provide stronger feedback and suggestions on an end-to-end level. The collected writing trajectories are viewed at https://minnesotanlp.github.io/REWARD_demo/
翻译:学术写作呈现出一个复杂空间,通常遵循系统化流程来规划并产生既逻辑严谨又富有创意的文本。近期涉及大规模语言模型(LLM)的研究在文本生成与修订任务中展现出显著成效;然而,LLM在提供对学术写作至关重要的文档级结构性与创造性反馈方面仍存在不足。本文提出一种创新分类法,依据意图、作者行为及所写数据的信息类型对学术写作行为进行分类。我们同时提供了ManuScript数据集,该原创数据集标注了简化版分类法,用以展示作者行为及其背后的意图。受认知写作理论启发,针对科学论文的分类法包含三个层级,旨在追踪总体写作流程并识别嵌入每个高级过程中的独特写作活动。ManuScript通过捕捉写作轨迹的线性与非线性特征,力求呈现学术写作过程的完整图景,从而使写作助手能够在端到端层面提供更强有力的反馈与建议。所采集的写作轨迹可于https://minnesotanlp.github.io/REWARD_demo/ 查看。