We introduce Syntax-Aware Fill-In-the-Middle (SAFIM), a new benchmark for evaluating Large Language Models (LLMs) on the code Fill-in-the-Middle (FIM) task. This benchmark focuses on syntax-aware completions of program structures such as code blocks and conditional expressions, and includes 17,720 examples from multiple programming languages, sourced from recent code submissions after April 2022 to minimize data contamination. SAFIM provides a robust framework with various prompt designs and novel syntax-aware post-processing techniques, facilitating accurate and fair comparisons across LLMs. Our comprehensive evaluation of 15 LLMs shows that FIM pretraining not only enhances FIM proficiency but also improves Left-to-Right (L2R) inference using LLMs. Our findings challenge conventional beliefs and suggest that pretraining methods and data quality have more impact than model size. SAFIM thus serves as a foundational platform for future research in effective pretraining strategies for code LLMs. The evaluation toolkit and dataset are available at https://github.com/gonglinyuan/safim, and the leaderboard is available at https://safimbenchmark.com.
翻译:我们引入了语法感知的代码填空(SAFIM)基准,用于评估大型语言模型(LLMs)在代码填空(FIM)任务上的表现。该基准专注于程序结构(如代码块和条件表达式)的语法感知补全,包含来自多种编程语言的17,720个示例,这些示例源自2022年4月之后的近期代码提交,以最大程度减少数据污染。SAFIM提供了一个稳健的框架,包含多种提示设计和新颖的语法感知后处理技术,有助于在LLMs之间进行准确和公平的比较。我们对15个LLMs的综合评估表明,FIM预训练不仅提升了FIM能力,还改进了使用LLMs进行从左到右(L2R)推理的效果。我们的研究结果挑战了传统观点,表明预训练方法和数据质量比模型规模更具影响力。因此,SAFIM为代码LLMs有效预训练策略的未来研究奠定了平台基础。评估工具包和数据集可在https://github.com/gonglinyuan/safim获取,排行榜可在https://safimbenchmark.com查看。