In this paper, we explore the feasibility of finding algorithm implementations from code. Successfully matching code and algorithms can help understand unknown code, provide reference implementations, and automatically collect data for learning-based program synthesis. To achieve the goal, we designed a new language named p-language to specify the algorithms and a static analyzer for the p-language to automatically extract control flow, math, and natural language information from the algorithm descriptions. We embedded the output of p-language (p-code) and source code in a common vector space using self-supervised machine learning methods to match algorithm with code without any manual annotation. We developed a tool named Beryllium. It takes pseudo code as a query and returns a list of ranked code snippets that likely match the algorithm query. Our evaluation on Stony Brook Algorithm Repository and popular GitHub projects show that Beryllium significantly outperformed the state-of-the-art code search tools in both C and Java. Specifically, for 98.5%, 93.8%, and 66.2% queries, we found the algorithm implementations in the top 25, 10, and 1 ranked list, respectively. Given 87 algorithm queries, we found implementations for 74 algorithms in the GitHub projects where we did not know the algorithms before.
翻译:在本文中,我们探索了从代码中寻找算法实现的可行性。成功匹配代码与算法有助于理解未知代码、提供参考实现,并自动收集数据以供基于学习的程序合成。为实现这一目标,我们设计了一种名为p-language的新语言来规范算法描述,并开发了针对p-language的静态分析器,以自动从算法描述中提取控制流、数学和自然语言信息。我们利用自监督机器学习方法将p-language的输出(p-code)与源代码嵌入到共同的向量空间中,从而无需任何手动标注即可实现算法与代码的匹配。我们开发了一个名为Beryllium的工具,它接受伪代码作为查询,并返回可能匹配算法查询的排序代码片段列表。我们在石溪算法库和流行的GitHub项目上的评估表明,Beryllium在C和Java代码搜索中均显著优于最先进的代码搜索工具。具体而言,对于98.5%、93.8%和66.2%的查询,我们分别在前25名、前10名和前1名的排名列表中找到了算法实现。给定87个算法查询,我们在之前未知算法的GitHub项目中找到了74个算法的实现。