We consider the problem of identifying the provenance of free/open source software (FOSS) and specifically the need of identifying where reused source code has been copied from. We propose a lightweight approach to solve the problem based on software identifiers-such as the names of variables, classes, and functions chosen by programmers. The proposed approach is able to efficiently narrow down to a small set of candidate origin products, to be further analyzed with more expensive techniques to make a final provenance determination.By analyzing the PyPI (Python Packaging Index) open source ecosystem we find that globally defined identifiers are very distinct. Across PyPI's 244 K packages we found 11.2 M different global identifiers (classes and method/function names-with only 0.6% of identifiers shared among the two types of entities); 76% of identifiers were used only in one package, and 93% in at most 3. Randomly selecting 3 non-frequent global identifiers from an input product is enough to narrow down its origins to a maximum of 3 products within 89% of the cases.We validate the proposed approach by mapping Debian source packages implemented in Python to the corresponding PyPI packages; this approach uses at most five trials, where each trial uses three randomly chosen global identifiers from a randomly chosen python file of the subject software package, then ranks results using a popularity index and requires to inspect only the top result. In our experiments, this method is effective at finding the true origin of a project with a recall of 0.9 and precision of 0.77.
翻译:我们考虑自由/开源软件(FOSS)出处识别问题,具体关注如何确定被复用源代码的原始来源。本文提出一种基于软件标识符的轻量化方法——即利用程序员选择的变量名、类名和函数名等标识符。该方法能够高效地将候选来源产品缩小至少量集合,以便后续采用更高代价的技术进行最终出处判定。通过对PyPI(Python软件包索引)开源生态系统的分析,我们发现全局定义的标识符具有高度独特性。在PyPI的24.4万个软件包中,共发现1120万个不同的全局标识符(类名和方法/函数名——仅0.6%标识符被两类实体共享);76%的标识符仅出现在一个软件包中,93%的标识符最多出现在3个软件包中。从输入产品中随机选取3个非频繁全局标识符,可在89%的案例中将其来源范围缩小至最多3个产品。我们通过将Debian中Python实现的源码包映射至对应PyPI软件包来验证该方法:该方案最多进行5次尝试,每次尝试从目标软件包的随机Python文件中选取3个随机全局标识符,然后根据流行度指数对结果排序,仅需检查排名第一的结果。实验表明,该方法在寻找项目真实来源时,召回率达0.9,精确率达0.77。