CodeComplex: A Time-Complexity Dataset for Bilingual Source Codes

Analyzing the worst-case time complexity of a code is a crucial task in computer science and software engineering for ensuring the efficiency, reliability, and robustness of software systems. However, it is well-known that the problem of determining the worst-case time complexity of a given code written in general-purpose programming language is theoretically undecidable by the famous Halting problem proven by Alan Turing. Thus, we move towards more realistic scenarios where the inputs and outputs of a program exist. This allows us to discern the correctness of given codes, challenging to analyze their time complexity exhaustively. In response to this challenge, we introduce CodeComplex, a novel source code dataset where each code is manually annotated with a corresponding worst-case time complexity. CodeComplex comprises 4,900 Java codes and an equivalent number of Python codes, all sourced from programming competitions and annotated with complexity labels by a panel of algorithmic experts. To the best of our knowledge, CodeComplex stands as the most extensive code dataset tailored for predicting complexity. Subsequently, we present the outcomes of our experiments employing various baseline models, leveraging state-of-the-art neural models in code comprehension like CodeBERT, GraphCodeBERT, UniXcoder, PLBART, CodeT5, CodeT5+, and ChatGPT. We analyze how the dataset impacts the model's learning in predicting time complexity.

翻译：分析代码的最坏情况时间复杂度是计算机科学与软件工程中的关键任务，旨在确保软件系统的效率、可靠性和鲁棒性。然而，众所周知，由于艾伦·图灵证明的著名停机问题，在通用编程语言中判断给定代码的最坏情况时间复杂度在理论上不可判定。因此，我们转向更实际的场景（即程序存在输入与输出），这使我们能够辨别给定代码的正确性，但难以详尽分析其时间复杂度。针对这一挑战，我们提出了CodeComplex——一个新颖的源代码数据集，其中每段代码均由人工标注对应的最坏情况时间复杂度。CodeComplex包含4,900个Java代码和同等数量的Python代码，所有代码均来自编程竞赛，并由算法专家团队标注复杂度标签。据我们所知，CodeComplex是当前面向复杂度预测的最庞大的代码数据集。随后，我们展示了采用多种基线模型的实验结果，这些模型利用了代码理解领域最先进的神经模型（如CodeBERT、GraphCodeBERT、UniXcoder、PLBART、CodeT5、CodeT5+及ChatGPT）。我们分析了该数据集如何影响模型在预测时间复杂度方面的学习效果。

相关内容

数据集

关注 88

数据集，又称为资料集、数据集合或资料集合，是一种由数据所组成的集合。
Data set（或dataset）是一个数据的集合，通常以表格形式出现。每一列代表一个特定变量。每一行都对应于某一成员的数据集的问题。它列出的价值观为每一个变量，如身高和体重的一个物体或价值的随机数。每个数值被称为数据资料。对应于行数，该数据集的数据可能包括一个或多个成员。

【NeurIPS2021】用于文本图表示学习的 GNN 嵌套 Transformer 模型：GraphFormers

专知会员服务

46+阅读 · 2021年11月24日

【亚马逊-WWW2020】不解析,生成!用于面向任务的语义分析的序列到序列体系结构，Don't Parse, Generate! A Sequence to Sequence Architecture for Task-Oriented Semantic Parsing

专知会员服务

15+阅读 · 2020年2月1日

FlowQA: Grasping Flow in History for Conversational Machine Comprehension

专知会员服务

35+阅读 · 2019年10月18日

Auto-Sizing the Transformer Network: Improving Speed, Efficiency, and Performance for Low-Resource Machine Translation

专知会员服务

50+阅读 · 2019年10月17日