Pre-trained language models like ChatGPT have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks. Moreover, in bioinformatics, generating functional programs poses additional notable challenges due to the amount of domain knowledge, the need for complicated data operations, and intricate functional dependencies between the operations. Here, we present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates 1026 functions and 1243 methods in Python and Java from GitHub and 253 examples from the Rosalind Project. BioCoder incorporates a fuzz-testing framework for evaluation, and we have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT. Our detailed analysis of these models emphasizes the importance of domain knowledge, pragmatic code generation, and contextual understanding. Our dataset, benchmark, Docker images, and scripts required for testing are all available at https://github.com/gersteinlab/biocoder.
翻译:像ChatGPT这样的预训练语言模型显著提升了代码生成能力。随着这些模型规模的扩大,其输出需要能够处理更复杂的任务。此外,在生物信息学中,由于需要大量领域知识、复杂的数据操作以及操作之间错综复杂的功能依赖关系,生成功能性程序面临额外的显著挑战。为此,我们提出了BioCoder,一个用于评估现有预训练模型生成生物信息学代码能力的基准测试。在函数代码生成方面,BioCoder涵盖了潜在的包依赖关系、类声明和全局变量。它包含了来自GitHub的1026个Python函数和1243个Java方法,以及来自Rosalind项目的253个示例。BioCoder集成了一个用于评估的模糊测试框架,并已将其应用于评估众多模型,包括InCoder、CodeGen、CodeGen2、SantaCoder、StarCoder、StarCoder+、InstructCodeT5+和ChatGPT。我们对这些模型的详细分析强调了领域知识、实用代码生成和上下文理解的重要性。我们的数据集、基准测试、Docker镜像以及测试所需的脚本均可在https://github.com/gersteinlab/biocoder获取。