Pre-trained large language models have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks and to be appropriately specialized to particular domains. Bioinformatics provides an important domain. In this field generating functional programs poses additional notable challenges due to the amount of specialized domain knowledge, the need for complicated data operations, and intricate functional dependencies between the operations. Here, we present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates 1026 functions and 1243 methods in Python and Java from GitHub and 253 examples from the Rosalind Project. BioCoder incorporates a fuzz-testing framework for evaluation, and we have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, GPT-3.5, and GPT-4. The results highlight two key aspects of successful models: 1) that they contain specific domain knowledge of bioinformatics (beyond just coding knowledge); 2) that they accommodate a long prompt with full context (i.e. functional dependencies). Our dataset, benchmark, Docker images, and scripts required for testing are all available at https://github.com/gersteinlab/biocoder.
翻译:预训练大语言模型显著提升了代码生成能力。随着模型规模不断扩展,其输出需要处理更复杂的任务并实现特定领域的专业化。生物信息学作为一个重要领域,由于涉及大量专业知识、复杂的数据操作及操作间精细的功能依赖关系,其功能程序生成面临显著挑战。为此,我们提出BioCoder——一个用于评估现有预训练模型生成生物信息学代码能力的基准测试。在函数代码生成方面,BioCoder涵盖了潜在的包依赖关系、类声明和全局变量,包含来自GitHub的1026个Python函数和1243个Java方法,以及来自Rosalind项目的253个示例。我们开发了基于模糊测试的评估框架,并应用于InCoder、CodeGen、CodeGen2、SantaCoder、StarCoder、StarCoder+、InstructCodeT5+、GPT-3.5和GPT-4等多个模型的评估。结果表明,成功模型需具备两个关键特性:1)包含生物信息学领域的特定知识(超越纯编程知识);2)能够处理包含完整上下文(即功能依赖关系)的长提示词。本研究所用的数据集、基准测试、Docker镜像及测试脚本均发布于https://github.com/gersteinlab/biocoder。