Topical: Learning Repository Embeddings from Source Code using Attention

Machine learning on source code (MLOnCode) promises to transform how software is delivered. By mining the context and relationship between software artefacts, MLOnCode augments the software developers capabilities with code auto-generation, code recommendation, code auto-tagging and other data-driven enhancements. For many of these tasks a script level representation of code is sufficient, however, in many cases a repository level representation that takes into account various dependencies and repository structure is imperative, for example, auto-tagging repositories with topics or auto-documentation of repository code etc. Existing methods for computing repository level representations suffer from (a) reliance on natural language documentation of code (for example, README files) (b) naive aggregation of method/script-level representation, for example, by concatenation or averaging. This paper introduces Topical a deep neural network to generate repository level embeddings of publicly available GitHub code repositories directly from source code. Topical incorporates an attention mechanism that projects the source code, the full dependency graph and the script level textual information into a dense repository-level representation. To compute the repository-level representations, Topical is trained to predict the topics associated with a repository, on a dataset of publicly available GitHub repositories that were crawled along with their ground truth topic tags. Our experiments show that the embeddings computed by Topical are able to outperform multiple baselines, including baselines that naively combine the method-level representations through averaging or concatenation at the task of repository auto-tagging.

翻译：摘要：对源代码的机器学习（MLOnCode）有望改变软件的交付方式。通过挖掘软件制品之间的上下文和关系，MLOnCode通过代码自动生成、代码推荐、代码自动标记及其他数据驱动增强来提升软件开发者的能力。对于许多此类任务，脚本级别的代码表示已足够，但在许多情况下，需要考虑各种依赖关系和仓库结构的仓库级别表示至关重要，例如，使用主题自动标记仓库或自动生成仓库代码文档等。现有的计算仓库级别表示的方法存在以下问题：（a）依赖代码的自然语言文档（例如README文件）；（b）通过连接或平均等方式对方法/脚本级别表示进行简单聚合。本文介绍了一种深度神经网络Topical，用于直接从源代码生成公开GitHub代码仓库的仓库级别嵌入。Topical采用注意力机制，将源代码、完整依赖关系图和脚本级别文本信息投影到密集的仓库级别表示中。为计算仓库级别表示，Topical通过对公开GitHub仓库数据集进行训练来预测与仓库相关联的主题，这些数据集包含爬取的地面真实主题标签。实验表明，在仓库自动标记任务中，Topical计算的嵌入在多个基线方法上表现出色，包括通过平均或连接等方式简单组合方法级别表示的基线方法。