Code based Language Models (LMs) have shown very promising results in the field of software engineering with applications such as code refinement, code completion and generation. However, the task of time and space complexity classification from code has not been extensively explored due to a lack of datasets, with prior endeavors being limited to Java. In this project, we aim to address these gaps by creating a labelled dataset of code snippets spanning multiple languages (Python and C++ datasets currently, with C, C#, and JavaScript datasets being released shortly). We find that existing time complexity calculation libraries and tools only apply to a limited number of use-cases. The lack of a well-defined rule based system motivates the application of several recently proposed code-based LMs. We demonstrate the effectiveness of dead code elimination and increasing the maximum sequence length of LMs. In addition to time complexity, we propose to use LMs to find space complexities from code, and to the best of our knowledge, this is the first attempt to do so. Furthermore, we introduce a novel code comprehension task, called cross-language transfer, where we fine-tune the LM on one language and run inference on another. Finally, we visualize the activation of the attention fed classification head of our LMs using Non-negative Matrix Factorization (NMF) to interpret our results.
翻译:基于代码的语言模型在软件工程领域表现出非常有前景的结果,应用于代码优化、代码补全和代码生成等任务。然而,由于缺乏数据集,从代码中分类时间和空间复杂度的任务尚未得到广泛探索,先前的努力仅限于Java。在本项目中,我们旨在通过创建一个涵盖多种编程语言(目前包括Python和C++数据集,C、C#和JavaScript数据集即将发布)的代码片段标记数据集来填补这些空白。我们发现现有的时间复杂度计算库和工具仅适用于有限的使用场景。缺乏基于规则的良好定义系统促使我们应用几种近期提出的基于代码的语言模型。我们展示了死代码消除和增加语言模型最大序列长度的有效性。除了时间复杂度,我们提议使用语言模型从代码中查找空间复杂度,据我们所知,这是首次尝试这样做。此外,我们引入了一项新的代码理解任务,称为跨语言迁移,即在一个语言上微调语言模型,并在另一个语言上运行推理。最后,我们使用非负矩阵分解(NMF)可视化我们语言模型的注意力馈送分类头的激活,以解释我们的结果。