Code based Language Models (LMs) have shown very promising results in the field of software engineering with applications such as code refinement, code completion and generation. However, the task of time and space complexity classification from code has not been extensively explored due to a lack of datasets, with prior endeavors being limited to Java. In this project, we aim to address these gaps by creating a labelled dataset of code snippets spanning multiple languages (Python and C++ datasets currently, with C, C#, and JavaScript datasets being released shortly). We find that existing time complexity calculation libraries and tools only apply to a limited number of use-cases. The lack of a well-defined rule based system motivates the application of several recently proposed code-based LMs. We demonstrate the effectiveness of dead code elimination and increasing the maximum sequence length of LMs. In addition to time complexity, we propose to use LMs to find space complexities from code, and to the best of our knowledge, this is the first attempt to do so. Furthermore, we introduce a novel code comprehension task, called cross-language transfer, where we fine-tune the LM on one language and run inference on another. Finally, we visualize the activation of the attention fed classification head of our LMs using Non-negative Matrix Factorization (NMF) to interpret our results.
翻译:基于代码的语言模型(LMs)在软件工程领域展示了显著成果,涵盖代码优化、代码补全与生成等应用。然而,由于缺乏数据集且先前研究仅局限于Java语言,从代码中分类时间与空间复杂度的任务尚未得到充分探索。本研究旨在通过构建跨多语言(目前包含Python和C++数据集,即将发布C、C#和JavaScript数据集)的带标签代码片段数据集来填补这些空白。我们发现现有时间复杂度计算库和工具仅适用于有限用例,缺乏明确定义的规则系统促使我们应用多个近期提出的基于代码的语言模型。我们验证了死代码消除与增加语言模型最大序列长度的有效性。除时间复杂度外,我们首创性地提出利用语言模型从代码中推断空间复杂度——据我们所知这是该领域的首次尝试。此外,我们提出一种新型代码理解任务——跨语言迁移,即在一个语言上微调语言模型后对另一语言进行推理。最后,我们采用非负矩阵分解(NMF)技术对语言模型注意力分类头的激活状态进行可视化以解释实验结果。