As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.
翻译:随着代码大语言模型(LLM)能力的持续扩展,其在多样化代码智能领域的应用正在迅速增长。然而,现有的大多数数据集仅评估有限的应用领域。为弥补这一不足,我们开发了一个专注于全栈编程的综合性代码评估数据集 FullStack Bench,其涵盖了广泛的应用领域(例如,基础编程、数据分析、软件工程、数学和机器学习)。此外,为评估多语言编程能力,我们在 FullStack Bench 中设计了来自 16 种广泛使用的编程语言的真实世界指令及相应的单元测试用例,以反映真实使用场景而非简单翻译。同时,我们还发布了一个支持多种编程语言和软件包的有效代码沙箱执行工具(即 SandboxFusion),以高效评估 FullStack Bench 的性能。在 FullStack Bench 上的综合实验结果证明了我们 FullStack Bench 和 SandboxFusion 的必要性和有效性。