In this paper, we explore the feasibility of finding algorithm implementations from code. Successfully matching code and algorithms can help understand unknown code, provide reference implementations, and automatically collect data for learning-based program synthesis. To achieve the goal, we designed a new language named p-language to specify the algorithms and a static analyzer for the p-language to automatically extract control flow, math, and natural language information from the algorithm descriptions. We embedded the output of p-language (p-code) and source code in a common vector space using self-supervised machine learning methods to match algorithm with code without any manual annotation. We developed a tool named Beryllium. It takes pseudo code as a query and returns a list of ranked code snippets that likely match the algorithm query. Our evaluation on Stony Brook Algorithm Repository and popular GitHub projects show that Beryllium significantly outperformed the state-of-the-art code search tools in both C and Java. Specifically, for 98.5%, 93.8%, and 66.2% queries, we found the algorithm implementations in the top 25, 10, and 1 ranked list, respectively. Given 87 algorithm queries, we found implementations for 74 algorithms in the GitHub projects where we did not know the algorithms before.
翻译:本文探索了从代码中寻找算法实现的可行性。成功匹配代码与算法有助于理解未知代码、提供参考实现,并自动收集数据以支持基于学习的程序合成。为实现该目标,我们设计了一种名为p-language的新语言来规范算法描述,并开发了相应的静态分析器,用于自动从算法描述中提取控制流、数学和自然语言信息。通过自监督机器学习方法,我们将p-language的输出(p-code)与源代码嵌入到同一向量空间中,无需任何人工标注即可实现算法与代码的匹配。我们开发了名为Beryllium的工具,该工具以伪代码为查询条件,返回按匹配概率排序的代码片段列表。在Stony Brook算法仓库和热门GitHub项目上的评估表明,Beryllium在C语言和Java语言中的代码搜索效果均显著优于现有最佳工具。具体而言,在排名前25、前10和前1的结果中,分别有98.5%、93.8%和66.2%的查询成功找到算法实现。针对87个算法查询,我们在此前未知算法的GitHub项目中成功找到了74个算法的实现。