COMEX: A Tool for Generating Customized Source Code Representations

Learning effective representations of source code is critical for any Machine Learning for Software Engineering (ML4SE) system. Inspired by natural language processing, large language models (LLMs) like Codex and CodeGen treat code as generic sequences of text and are trained on huge corpora of code data, achieving state of the art performance on several software engineering (SE) tasks. However, valid source code, unlike natural language, follows a strict structure and pattern governed by the underlying grammar of the programming language. Current LLMs do not exploit this property of the source code as they treat code like a sequence of tokens and overlook key structural and semantic properties of code that can be extracted from code-views like the Control Flow Graph (CFG), Data Flow Graph (DFG), Abstract Syntax Tree (AST), etc. Unfortunately, the process of generating and integrating code-views for every programming language is cumbersome and time consuming. To overcome this barrier, we propose our tool COMEX - a framework that allows researchers and developers to create and combine multiple code-views which can be used by machine learning (ML) models for various SE tasks. Some salient features of our tool are: (i) it works directly on source code (which need not be compilable), (ii) it currently supports Java and C#, (iii) it can analyze both method-level snippets and program-level snippets by using both intra-procedural and inter-procedural analysis, and (iv) it is easily extendable to other languages as it is built on tree-sitter - a widely used incremental parser that supports over 40 languages. We believe this easy-to-use code-view generation and customization tool will give impetus to research in source code representation learning methods and ML4SE. Tool: https://pypi.org/project/comex - GitHub: https://github.com/IBM/tree-sitter-codeviews - Demo: https://youtu.be/GER6U87FVbU

翻译：学习源代码的有效表示对于任何机器学习软件工程（ML4SE）系统至关重要。受自然语言处理启发，Codex 和 CodeGen 等大型语言模型（LLM）将代码视为通用文本序列，并在大规模代码语料库上进行训练，在多项软件工程（SE）任务上取得了最先进的性能。然而，与自然语言不同，有效的源代码遵循由编程语言底层语法所决定的严格结构和模式。当前的 LLM 并未利用这一源代码特性，而是将代码视为令牌序列，忽略了从控制流图（CFG）、数据流图（DFG）、抽象语法树（AST）等代码视图中提取的关键结构和语义属性。遗憾的是，为每种编程语言生成并集成代码视图的过程既繁琐又耗时。为克服这一障碍，我们提出了工具 COMEX——一个允许研究人员和开发者创建并组合多种代码视图的框架，这些视图可被机器学习（ML）模型用于各种 SE 任务。我们工具的一些显著特点是：（i）它直接处理源代码（无需可编译），（ii）目前支持 Java 和 C#，（iii）通过使用过程内分析和过程间分析，能够分析方法级片段和程序级片段，（iv）由于构建在 tree-sitter（一种广泛使用的增量解析器，支持超过 40 种语言）之上，因此易于扩展到其他语言。我们相信，这种易于使用的代码视图生成与定制工具将推动源代码表示学习方法及 ML4SE 的研究。工具：https://pypi.org/project/comex - GitHub：https://github.com/IBM/tree-sitter-codeviews - 演示：https://youtu.be/GER6U87FVbU