Code intelligence plays a key role in transforming modern software engineering. Recently, deep learning-based models, especially Transformer-based large language models (LLMs), have demonstrated remarkable potential in tackling these tasks by leveraging massive open-source code data and programming language features. However, the development and deployment of such models often require expertise in both machine learning and software engineering, creating a barrier for the model adoption. In this paper, we present CodeTF, an open-source Transformer-based library for state-of-the-art Code LLMs and code intelligence. Following the principles of modular design and extensible framework, we design CodeTF with a unified interface to enable rapid access and development across different types of models, datasets and tasks. Our library supports a collection of pretrained Code LLM models and popular code benchmarks, including a standardized interface to train and serve code LLMs efficiently, and data features such as language-specific parsers and utility functions for extracting code attributes. In this paper, we describe the design principles, the architecture, key modules and components, and compare with other related library tools. Finally, we hope CodeTF is able to bridge the gap between machine learning/generative AI and software engineering, providing a comprehensive open-source solution for developers, researchers, and practitioners.
翻译:摘要:代码智能在推动现代软件工程转型中发挥着关键作用。近年来,基于深度学习的模型,特别是基于Transformer的大语言模型(LLMs),通过利用大规模开源代码数据和编程语言特性,在应对这些任务方面展现出显著潜力。然而,此类模型的开发与部署通常需要机器学习与软件工程双领域的专业知识,这为模型的应用设置了障碍。本文提出CodeTF——一个面向最先进代码大语言模型及代码智能的开源Transformer库。基于模块化设计与可扩展框架原则,我们采用统一接口设计CodeTF,使其能够快速接入并开发不同类型的模型、数据集和任务。该库支持一系列预训练代码大语言模型与主流代码基准测试,包括用于高效训练与部署代码大语言模型的标准化接口,以及语言专用解析器、代码属性提取工具函数等数据特性。本文阐述了其设计原则、架构、关键模块与组件,并与其他相关库工具进行了比较。最后,我们期望CodeTF能够弥合机器学习/生成式人工智能与软件工程之间的鸿沟,为开发者、研究人员和从业者提供全面的开源解决方案。