We introduce Syntax-Aware Fill-In-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. Our comprehensive evaluation of 15 LLMs shows that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. Our findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at https://github.com/gonglinyuan/safim, and the leaderboard is available at https://safimbenchmark.com.
翻译:我们提出了语法感知填充中间(SAFIM)基准,用于评估大语言模型(LLM)在代码填充中间(FIM)任务上的性能。该基准重点关注对程序结构(如代码块和条件表达式)的语法感知补全,包含来自多种编程语言的17,720个示例,这些示例源自2022年4月后的近期代码提交,以最大限度减少数据污染。SAFIM提供了一个包含多种提示设计和新颖语法感知后处理技术的稳健框架,有助于在LLM之间进行准确且公平的比较。我们对15个LLM的全面评估表明,FIM预训练不仅能提升FIM任务熟练度,还能改善LLM在自左向右(L2R)推理中的表现。我们的发现挑战了传统观念,并表明预训练方法和数据质量比模型规模更具影响力。因此,SAFIM可作为未来研究代码LLM有效预训练策略的基础平台。评估工具包和数据集发布于https://github.com/gonglinyuan/safim,排行榜可在https://safimbenchmark.com查看。