Pre-trained large language models have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Here, we target bioinformatics due to the amount of specialized domain knowledge, algorithms, and data operations this discipline requires. We present BioCoder, a benchmark developed to evaluate large language models (LLMs) in generating bioinformatics-specific code. BioCoder spans a broad spectrum of the field and covers cross-file dependencies, class declarations, and global variables. It incorporates 1026 Python functions and 1243 Java methods extracted from GitHub, along with 253 examples from the Rosalind Project, all pertaining to bioinformatics. Using topic modeling we show that overall coverage of the included code is representative of the full spectrum of bioinformatics calculations. BioCoder incorporates a fuzz-testing framework for evaluation. We have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. Furthermore, we finetuned StarCoder, demonstrating how our dataset can effectively enhance the performance of LLMs on our benchmark (by >15% in terms of Pass@K in certain prompt configurations and always >3%). The results highlight two key aspects of successful models: (1) Successful models accommodate a long prompt (> ~2600 tokens) with full context, for functional dependencies. (2) They contain specific domain knowledge of bioinformatics, beyond just general coding knowledge. This is evident from the performance gain of GPT-3.5/4 compared to the smaller models on the benchmark (50% vs up to ~25%). Our dataset, benchmark, Docker images, and scripts required for testing are all available at https://github.com/gersteinlab/biocoder.
翻译:预训练大语言模型显著提升了代码生成能力。随着模型规模扩大,输出需处理更复杂任务并具备特定领域专业性的需求日益增长。本研究聚焦生物信息学领域,因其需要大量专业化领域知识、算法及数据操作。我们提出BioCoder基准,用于评估大语言模型生成生物信息学专用代码的能力。该基准覆盖该领域广泛范围,包含跨文件依赖、类声明和全局变量,整合了从GitHub提取的1026个Python函数与1243个Java方法,以及来自Rosalind项目的253个生物信息学实例。通过主题建模分析,我们发现所收录代码的总体覆盖率能够代表生物信息学计算的全貌。BioCoder配备了用于评估的模糊测试框架,已应用于InCoder、CodeGen、CodeGen2、SantaCoder、StarCoder、StarCoder+、InstructCodeT5+、GPT-3.5和GPT-4等模型的评估。此外,我们对StarCoder进行微调,验证了数据集可有效提升大语言模型在该基准上的性能(特定提示配置下Pass@K提升超15%,且始终高于3%)。研究结果揭示了成功模型的两大关键特征:(1)成功模型能处理包含完整上下文的长提示(超过约2600词元),以支持功能依赖;(2)除通用编码知识外,模型需具备特定生物信息学领域知识。这体现在GPT-3.5/4相较于小型模型在该基准上的性能优势(50%对比约25%)。我们的数据集、基准、Docker镜像及测试所需脚本均可在https://github.com/gersteinlab/biocoder获取。