Code classification is a difficult issue in program understanding and automatic coding. Due to the elusive syntax and complicated semantics in programs, most existing studies use techniques based on abstract syntax tree (AST) and graph neural network (GNN) to create code representations for code classification. These techniques utilize the structure and semantic information of the code, but they only take into account pairwise associations and neglect the high-order correlations that already exist between nodes in the AST, which may result in the loss of code structural information. On the other hand, while a general hypergraph can encode high-order data correlations, it is homogeneous and undirected which will result in a lack of semantic and structural information such as node types, edge types, and directions between child nodes and parent nodes when modeling AST. In this study, we propose to represent AST as a heterogeneous directed hypergraph (HDHG) and process the graph by heterogeneous directed hypergraph neural network (HDHGN) for code classification. Our method improves code understanding and can represent high-order data correlations beyond paired interactions. We assess heterogeneous directed hypergraph neural network (HDHGN) on public datasets of Python and Java programs. Our method outperforms previous AST-based and GNN-based methods, which demonstrates the capability of our model.
翻译:代码分类是程序理解与自动编码中的一项难题。由于程序中语法的隐晦性与语义的复杂性,现有研究大多采用基于抽象语法树(AST)和图神经网络(GNN)的技术来生成代码表征以进行分类。这些技术利用了代码的结构与语义信息,但仅考虑了节点间的成对关联,忽略了抽象语法树中节点间存在的高阶相关性,可能导致代码结构信息的丢失。另一方面,尽管通用超图能编码高阶数据关联,但其同质且无向的特性,在建模抽象语法树时会导致节点类型、边类型以及子节点与父节点之间的方向等语义与结构信息缺失。本研究提出将抽象语法树表示为异质有向超图(HDHG),并采用异质有向超图神经网络(HDHGN)处理该图以进行代码分类。该方法增强了代码理解能力,并能表示超越成对交互的高阶数据关联。我们在Python和Java程序的公开数据集上评估了异质有向超图神经网络(HDHGN)。实验结果表明,该方法优于现有基于AST和GNN的方法,展现了我们模型的强大能力。