Pre-trained language models like ChatGPT have significantly improved code generation. As these models scale up, there is an increasing need for the output to handle more intricate tasks. Moreover, in bioinformatics, generating functional programs poses additional notable challenges due to the amount of domain knowledge, the need for complicated data operations, and intricate functional dependencies between the operations. Here, we present BioCoder, a benchmark developed to evaluate existing pre-trained models in generating bioinformatics code. In relation to function-code generation, BioCoder covers potential package dependencies, class declarations, and global variables. It incorporates 1026 functions and 1243 methods in Python and Java from GitHub and 253 examples from the Rosalind Project. BioCoder incorporates a fuzz-testing framework for evaluation, and we have applied it to evaluate many models including InCoder, CodeGen, CodeGen2, SantaCoder, StarCoder, StarCoder+, InstructCodeT5+, and ChatGPT. Our detailed analysis of these models emphasizes the importance of domain knowledge, pragmatic code generation, and contextual understanding. Our dataset, benchmark, Docker images, and scripts required for testing are all available at https://github.com/gersteinlab/biocoder.
翻译:像ChatGPT这样的预训练语言模型显著提升了代码生成能力。随着这些模型规模的扩大,输出结果需要处理更加复杂的任务。在生物信息学领域,由于需要海量领域知识、复杂的数据操作以及操作间错综复杂的功能依赖关系,生成功能性程序带来了额外的显著挑战。为此,我们提出了BioCoder——一个用于评估现有预训练模型在生物信息学代码生成方面能力的基准。针对功能代码生成任务,BioCoder涵盖了潜在的包依赖关系、类声明和全局变量。它收录了来自GitHub的1026个函数和1243个方法(含Python和Java语言),以及来自Rosalind项目的253个示例。BioCoder集成了一个用于评估的模糊测试框架,我们已将其应用于评估包括InCoder、CodeGen、CodeGen2、SantaCoder、StarCoder、StarCoder+、InstructCodeT5+和ChatGPT在内的多个模型。对这些模型的详细分析凸显了领域知识、实用代码生成和上下文理解的重要性。我们的数据集、基准测试工具、Docker镜像及测试所需脚本均可在https://github.com/gersteinlab/biocoder获取。