Parse Tree Tracking Through Time for Programming Process Analysis at Scale

Background and Context: Programming process data can be utilized to understand the processes students use to write computer programming assignments. Keystroke- and line-level event logs have been used in the past in various ways, primarily in high-level descriptive statistics (e.g., timings, character deletion rate, etc). Analysis of behavior in context (e.g., how much time students spend working on loops) has been cumbersome because of our inability to automatically track high-level code representations, such as abstract syntax trees, through time and unparseable states. Objective: Our study has two goals. The first is to design the first algorithm that tracks parse tree nodes through time. Second, we utilize this algorithm to perform a partial replication study of prior work that used manual tracking of code representations, as well as other novel analyses of student programming behavior that can now be done at scale. Method: We use two algorithms presented in this paper to track parse tree nodes through time and construct tree representations for unparseable code states. We apply these algorithms to a public keystroke data from student coursework in a 2021 CS1 course and conduct analysis on the resulting parse trees. Findings: We discover newly observable statistics at scale, including that code is deleted at similar rates inside and outside of conditionals and loops, a third of commented out code is eventually restored, and that frequency with which students jump around in their code may not be indicative of struggle. Implications: The ability to track parse trees through time opens the door to understanding new dimensions of student programming, such as best practices of structural development of code over time, quantitative measurement of what syntactic constructs students struggle most with, refactoring behavior, and attention shifting within the code.

翻译：背景与语境：编程过程数据可用于理解学生完成计算机编程作业时所采用的流程。以往研究曾以多种方式利用击键级和行级事件日志，主要集中于高层描述性统计（如时间分布、字符删除率等）。由于无法自动追踪高层代码表示（如抽象语法树）在时间维度及不可解析状态下的演变，对上下文相关行为（例如学生在循环结构上花费的时间）的分析一直较为繁琐。研究目标：本研究设定两个目标。首先是设计首个能够跨时间追踪语法分析树节点的算法。其次，运用该算法对先前依赖人工追踪代码表示的研究进行部分复现，并开展其他新型的学生编程行为分析——这些分析如今可规模化实施。研究方法：采用本文提出的两种算法实现语法分析树节点的时间维度追踪，并为不可解析的代码状态构建树形表示。我们将这些算法应用于2021年CS1课程中学生作业的公开击键数据，并对生成的语法分析树展开分析。研究发现：我们规模化地发现了新的可观测统计规律，包括：条件语句和循环结构内外的代码删除率相近、三分之一的注释屏蔽代码最终会被恢复、学生在代码中的跳跃式编辑频率可能并非反映编程困境的有效指标。研究意义：跨时间追踪语法分析树的能力为理解学生编程的新维度开启了大门，例如：代码结构随时间演进的最佳实践模式、学生最易困惑的句法结构的量化测量、代码重构行为特征以及编程注意力的动态转移规律。