Spreadsheets are a vital tool for end-user data management. Using large language models for formula authoring assistance in these environments can be difficult, as these models are expensive to train and challenging to deploy due to their size (up to billions of parameters). We present FLAME, a transformer-based model trained exclusively on Excel formulas that leverages domain insights to achieve competitive performance while being substantially smaller (60M parameters) and training on two orders of magnitude less data. We curate a training dataset using sketch deduplication, introduce an Excel-specific formula tokenizer, and use domain-specific versions of masked span prediction and noisy auto-encoding as pre-training objectives. We evaluate FLAME on formula repair, formula completion, and similarity-based formula retrieval. FLAME can outperform much larger models, such as the Davinci (175B) and Cushman (12B) variants of Codex and CodeT5 (220M), in 10 of 14 evaluation settings for the repair and completion tasks. For formula retrieval, FLAME outperforms CodeT5, CodeBERT, and GraphCodeBERT.
翻译:电子表格是终端用户数据管理的重要工具。在此类环境中使用大语言模型辅助公式编写存在困难,因为这类模型训练成本高昂且因其规模(参数高达数十亿)部署挑战巨大。我们提出FLAME——一个基于Transformer架构、仅使用Excel公式训练的模型,通过利用领域知识在保持较小规模(6000万参数)且训练数据量低两个数量级的情况下实现竞争性性能。我们采用草图去重技术构建训练数据集,引入Excel专用公式分词器,并将领域适配版本的掩码跨度预测与噪声自编码作为预训练目标。我们在公式修复、公式补全及基于相似度的公式检索任务上评估FLAME。在修复与补全任务的14项评估设置中,FLAME在10项上超越Codex的Davinci(175B)与Cushman(12B)变体及CodeT5(220M)等更大规模模型。在公式检索任务中,FLAME的性能优于CodeT5、CodeBERT和GraphCodeBERT。