Source code representation with deep learning techniques is an important research field. There have been many studies that learn sequential or structural information for code representation. But sequence-based models and non-sequence-models both have their limitations. Researchers attempt to incorporate structural information to sequence-based models, but they only mine part of token-level hierarchical structure information. In this paper, we analyze how the complete hierarchical structure influences the tokens in code sequences and abstract this influence as a property of code tokens called hierarchical embedding. The hierarchical embedding is further divided into statement-level global hierarchy and token-level local hierarchy. Furthermore, we propose the Hierarchy Transformer (HiT), a simple but effective sequence model to incorporate the complete hierarchical embeddings of source code into a Transformer model. We demonstrate the effectiveness of hierarchical embedding on learning code structure with an experiment on variable scope detection task. Further evaluation shows that HiT outperforms SOTA baseline models and show stable training efficiency on three source code-related tasks involving classification and generation tasks across 8 different datasets.
翻译:利用深度学习技术进行源代码表示是一个重要的研究领域。已有大量研究通过序列信息或结构信息学习代码表示。但基于序列的模型与非序列模型均存在各自的局限性。研究者尝试将结构信息融入序列模型,但仅挖掘了部分token级别的层次结构信息。本文分析了完整层次结构如何影响代码序列中的token,并将这种影响抽象为代码token的属性——层次化嵌入。进一步将层次化嵌入划分为语句级全局层次与token级局部层次。在此基础上,提出层次化Transformer(HiT)——一种简单但有效的序列模型,可将完整的源代码层次化嵌入融入Transformer模型。通过变量作用域检测任务实验,验证了层次化嵌入在学习代码结构方面的有效性。进一步评估表明,HiT在涉及分类与生成任务的8个不同数据集上优于现有最优基线模型,并展现出稳定的训练效率。