Categorizing source codes accurately and efficiently is a challenging problem in real-world programming education platform management. In recent years, model-based approaches utilizing abstract syntax trees (ASTs) have been widely applied to code classification tasks. We introduce an approach named the Sparse Attention-based neural network for Code Classification (SACC) in this paper. The approach involves two main steps: In the first step, source code undergoes syntax parsing and preprocessing. The generated abstract syntax tree is split into sequences of subtrees and then encoded using a recursive neural network to obtain a high-dimensional representation. This step simultaneously considers both the logical structure and lexical level information contained within the code. In the second step, the encoded sequences of subtrees are fed into a Transformer model that incorporates sparse attention mechanisms for the purpose of classification. This method efficiently reduces the computational cost of the self-attention mechanisms, thus improving the training speed while preserving effectiveness. Our work introduces a carefully designed sparse attention pattern that is specifically designed to meet the unique needs of code classification tasks. This design helps reduce the influence of redundant information and enhances the overall performance of the model. Finally, we also deal with problems in previous related research, which include issues like incomplete classification labels and a small dataset size. We annotated the CodeNet dataset with algorithm-related labeling categories, which contains a significantly large amount of data. Extensive comparative experimental results demonstrate the effectiveness and efficiency of SACC for the code classification tasks.
翻译:准确且高效地对源代码进行分类是实际编程教育平台管理中的一个具有挑战性的问题。近年来,利用抽象语法树的基于模型的方法已被广泛应用于代码分类任务。本文提出了一种名为基于稀疏注意力的代码分类神经网络(SACC)的方法。该方法包含两个主要步骤:第一步,对源代码进行语法解析和预处理。生成的抽象语法树被分割成子树序列,然后使用递归神经网络进行编码以获取高维表示。该步骤同时考虑了代码中包含的逻辑结构和词汇层面信息。第二步,将编码后的子树序列输入到整合了稀疏注意力机制的Transformer模型中进行分类。该方法有效降低了自注意力机制的计算成本,从而在保持有效性的同时提升了训练速度。我们的工作引入了一种精心设计的稀疏注意力模式,专门针对代码分类任务的独特需求进行设计。这种设计有助于减少冗余信息的影响,并提升模型的整体性能。最后,我们还解决了以往相关研究中的问题,包括分类标签不完整和数据集规模较小等问题。我们使用与算法相关的标注类别对CodeNet数据集进行了标注,该数据集包含大量数据。广泛的对比实验结果表明,SACC在代码分类任务中具有有效性和高效性。