As the capabilities of code large language models (LLMs) continue to expand, their applications across diverse code intelligence domains are rapidly increasing. However, most existing datasets only evaluate limited application domains. To address this gap, we have developed a comprehensive code evaluation dataset FullStack Bench focusing on full-stack programming, which encompasses a wide range of application domains (e.g., basic programming, data analysis, software engineering, mathematics, and machine learning). Besides, to assess multilingual programming capabilities, in FullStack Bench, we design real-world instructions and corresponding unit test cases from 16 widely-used programming languages to reflect real-world usage scenarios rather than simple translations. Moreover, we also release an effective code sandbox execution tool (i.e., SandboxFusion) supporting various programming languages and packages to evaluate the performance of our FullStack Bench efficiently. Comprehensive experimental results on our FullStack Bench demonstrate the necessity and effectiveness of our FullStack Bench and SandboxFusion.
翻译:随着代码大型语言模型(LLM)能力的持续扩展,其在多样化代码智能领域的应用正迅速增长。然而,现有大多数数据集仅评估有限的应用领域。为填补这一空白,我们开发了一个专注于全栈编程的综合性代码评估数据集 FullStack Bench,其涵盖广泛的应用领域(例如基础编程、数据分析、软件工程、数学和机器学习)。此外,为评估多语言编程能力,在 FullStack Bench 中,我们基于 16 种广泛使用的编程语言设计了真实世界的指令及相应的单元测试用例,以反映实际使用场景而非简单翻译。同时,我们还发布了一个支持多种编程语言和包的有效代码沙箱执行工具(即 SandboxFusion),以高效评估 FullStack Bench 的性能。在 FullStack Bench 上的全面实验结果证明了我们 FullStack Bench 和 SandboxFusion 的必要性和有效性。