There are several approaches for encoding source code in the input vectors of neural models. These approaches attempt to include various syntactic and semantic features of input programs in their encoding. In this paper, we investigate Code2Snapshot, a novel representation of the source code that is based on the snapshots of input programs. We evaluate several variations of this representation and compare its performance with state-of-the-art representations that utilize the rich syntactic and semantic features of input programs. Our preliminary study on the utility of Code2Snapshot in the code summarization and code classification tasks suggests that simple snapshots of input programs have comparable performance to state-of-the-art representations. Interestingly, obscuring input programs have insignificant impacts on the Code2Snapshot performance, suggesting that, for some tasks, neural models may provide high performance by relying merely on the structure of input programs.
翻译:现有多种将源代码编码为神经网络输入向量的方法。这些方法试图在其编码过程中融入输入程序的各类语法与语义特征。本文探究了一种基于输入程序快照的新型源代码表示方法——Code2Snapshot。我们评估了该表示方法的多种变体,并将其与利用输入程序丰富语法及语义特征的现有最优表示方法进行性能对比。针对Code2Snapshot在代码摘要与代码分类任务中的初步效用研究显示,输入程序的简单快照即可取得与现有最优表示方法相当的性能。有趣的是,对输入程序进行混淆处理对Code2Snapshot性能影响甚微,这表明在某些任务中,神经网络模型可能仅依赖输入程序的结构即可实现高性能。