A Dataset of Android Libraries

Android app developers extensively employ code reuse, integrating many third-party libraries into their apps. While such integration is practical for developers, it can be challenging for static analyzers to achieve scalability and precision when such libraries can account for a large part of the app code. As a direct consequence, when a static analysis is performed, it is common practice in the literature to only consider developer code --with the assumption that the sought issues are in developer code rather than in the libraries. However, analysts need to precisely distinguish between library code and developer code in Android apps to ensure the effectiveness of static analysis. Currently, many static analysis approaches rely on white lists of libraries. However, these white lists are unreliable, as they are inaccurate and largely non-comprehensive. In this paper, we propose a new approach to address the lack of comprehensive and automated solutions for the production of accurate and "always up to date" sets of third-party libraries. First, we demonstrate the continued need for a white list of third-party libraries. Second, we propose an automated approach to produce an accurate and up-to-date set of third-party libraries in the form of a dataset called AndroLibZoo. Our dataset, which we make available to the research community, contains to date 20 162 libraries and is meant to evolve. Third, we illustrate the significance of using AndroLibZoo to filter libraries in recent apps. Fourth, we demonstrate that AndroLibZoo is more suitable than the current state-of-the-art list for improved static analysis. Finally, we show how the use of AndroLibZoo can enhance the performance of existing Android app static analyzers.

翻译：Android应用开发者广泛采用代码复用，将大量第三方库集成到应用中。虽然这种集成对开发者而言具有实用性，但当第三方库占据应用代码较大比例时，静态分析工具在实现可扩展性和精确性方面面临挑战。因此，在进行静态分析时，文献中的常见做法是仅考虑开发者编写的代码——假设待检测的问题存在于开发者代码而非第三方库中。然而，分析人员需要精确区分Android应用中的库代码与开发者代码，以确保静态分析的有效性。当前，许多静态分析方法依赖库白名单。但这些白名单因不准确且缺乏全面性而不可靠。本文提出一种新方法，以解决当前缺乏全面自动化解决方案来生成精确且"持续更新"的第三方库集合的问题。首先，我们论证了第三方库白名单的持续需求。其次，我们提出一种自动化方法，以数据集形式（命名为AndroLibZoo）生成精确且及时更新的第三方库集合。该数据集向研究社区开放，目前包含20,162个库，并将持续演进。第三，我们展示了使用AndroLibZoo从最新应用中过滤库的重要性。第四，我们证明AndroLibZoo比现有最先进的库列表更适用于改进静态分析。最后，我们阐述AndroLibZoo如何提升现有Android应用静态分析器的性能。