Cross-ecosystem categorization: A manual-curation protocol for the categorization of Java Maven libraries along Python PyPI Topics

Context: Software of different functional categories, such as text processing vs. networking, has different profiles in terms of metrics like security and updates. Using popularity to compare e.g. Java vs. Python libraries might give a skewed perspective, as the categories of the most popular software vary from one ecosystem to the next. How can one compare libraries datasets across software ecosystems, when not even the category names are uniform among them? Objective: We study how to generate a language-agnostic categorisation of software by functional purpose, that enables cross-ecosystem studies of libraries datasets. This provides the functional fingerprint information needed for software metrics comparisons. Method: We designed and implemented a human-guided protocol to categorise libraries from software ecosystems. Category names mirror PyPI Topic classifiers, but the protocol is generic and can be applied to any ecosystem. We demonstrate it by categorising 256 Java/Maven libraries with severe security vulnerabilities. Results: The protocol allows three or more people to categorise any number of libraries. The categorisation produced is functional-oriented and language-agnostic. The Java/Maven dataset demonstration resulted in a majority of Internet-oriented libraries, coherent with its selection by severe vulnerabilities. To allow replication and updates, we make the dataset and the protocol individual steps available as open data. Conclusions: Libraries categorisation by functional purpose is feasible with our protocol, which produced the fingerprint of a 256-libraries Java dataset. While this was labour intensive, humans excel in the required inference tasks, so full automation of the process is not envisioned. However, results can provide the ground truth needed for machine learning in large-scale cross-ecosystem empirical studies.

翻译：背景：不同功能类别的软件（如文本处理与网络通信）在安全性和更新频率等指标上存在差异。使用流行度比较Java与Python库可能会产生偏差，因为各生态系统中流行软件的功能类别并不相同。当不同软件生态系统的类别命名都不统一时，如何实现跨生态系统库数据集的比较？目标：本研究旨在探索如何构建基于功能目的的语言无关软件分类方法，从而支持跨生态系统的库数据集研究，为软件度量比较提供所需的功能指纹信息。方法：我们设计并实施了一项人工引导协议，用于对软件生态系统中的库进行分类。类别名称参照PyPI主题分类标准，但该协议具有通用性，可适用于任何生态系统。我们通过分类256个存在严重安全漏洞的Java/Maven库来验证该方法的有效性。结果：该协议支持三人或以上人员对任意数量的库进行分类。生成的分类具有功能导向和语言无关特性。Java/Maven数据集验证显示，多数库属于互联网导向类，这与该数据集通过严重漏洞筛选的标准一致。为支持可重复性和更新，我们将数据集及协议各步骤作为开放数据发布。结论：我们的协议能够可靠地实现基于功能目的的库分类，并成功生成了包含256个Java库的数据集指纹。尽管人工分类较为耗时，但人类在所需推理任务中表现优异，因此未考虑完全自动化。然而，该结果可为大规模跨生态系统实证研究中的机器学习提供所需的真实标注数据。