Predictive code completion greatly accelerates how quickly developers work. In spreadsheets, despite being much more common, such auto-completion features are virtually non-existent. To address this gap, we introduce a benchmark for systems that observe a sequence of user actions in a spreadsheet and predict future actions. Two challenges are (1) the absence of edit histories in public spreadsheet corpora and (2) the complex space of spreadsheet actions (spatial, temporal, composite). To address (1), we manually curate 52 sequences of 12K actions that recreate spreadsheets from public corpora, seeded by parametrized heuristics and LLM refinement. To address (2), we propose an online evaluation that expects a prediction after each user action, accepts or rejects that prediction, updates the future actions upon acceptance, and repeats this until the target spreadsheet is obtained. We use multiple baseline predictors (including zero-shot LLMs, fine-tuned SLMs, and classical models) and analyze different properties that our benchmark teaches us, including but not limited to: properties of saved actions and false positives, efficiency, effect of user profiles, effect of triggers, and effect of context.
翻译:预测性代码补全极大加速了开发者的工作效率。在更为常见的电子表格中,此类自动补全功能却几乎不存在。为弥补这一空白,我们引入了一个基准测试系统,用于观察电子表格中用户操作序列并预测未来动作。面临两大挑战:(1) 公开电子表格语料库缺乏编辑历史记录;(2) 电子表格动作的复杂空间特性(空间性、时间性、组合性)。针对(1),我们人工整理了52个包含12K动作的操作序列,通过参数化启发式算法和大语言模型精炼,重构了公开语料库中的电子表格。针对(2),我们提出了一种在线评估方法:在每个用户操作后预测后续动作,接受或拒绝该预测,若接受则更新未来操作序列,重复此过程直至获得目标电子表格。我们采用多种基线预测器(包括零样本大语言模型、微调小语言模型和经典模型),分析了基准测试揭示的不同特性,包括但不限于:已保存操作与误报特性、效率、用户画像影响、触发机制影响及上下文影响。