Using the Uniqueness of Global Identifiers to Determine the Provenance of Python Software Source Code

We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers-such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination.By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI's 244 K packages we found 11.2 M different global identifiers (classes and method/function names-with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases.We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.

翻译：我们考虑自由/开源软件（FOSS）出处识别问题，具体关注如何确定被复用源代码的原始来源。本文提出一种基于软件标识符的轻量化方法——即利用程序员选择的变量名、类名和函数名等标识符。该方法能够高效地将候选来源产品缩小至少量集合，以便后续采用更高代价的技术进行最终出处判定。通过对PyPI（Python软件包索引）开源生态系统的分析，我们发现全局定义的标识符具有高度独特性。在PyPI的24.4万个软件包中，共发现1120万个不同的全局标识符（类名和方法/函数名——仅0.6%标识符被两类实体共享）；76%的标识符仅出现在一个软件包中，93%的标识符最多出现在3个软件包中。从输入产品中随机选取3个非频繁全局标识符，可在89%的案例中将其来源范围缩小至最多3个产品。我们通过将Debian中Python实现的源码包映射至对应PyPI软件包来验证该方法：该方案最多进行5次尝试，每次尝试从目标软件包的随机Python文件中选取3个随机全局标识符，然后根据流行度指数对结果排序，仅需检查排名第一的结果。实验表明，该方法在寻找项目真实来源时，召回率达0.9，精确率达0.77。