LibAM: An Area Matching Framework for Detecting Third-party Libraries in Binaries

Third-party libraries (TPLs) are extensively utilized by developers to expedite the software development process and incorporate external functionalities. Nevertheless, insecure TPL reuse can lead to significant security risks. Existing methods are employed to determine the presence of TPL code in the target binary. Existing methods, which involve extracting strings or conducting function matching, are employed to determine the presence of TPL code in the target binary. However, these methods often yield unsatisfactory results due to the recurrence of strings and the presence of numerous similar non-homologous functions. Additionally, they struggle to identify specific pieces of reused code in the target binary, complicating the detection of complex reuse relationships and impeding downstream tasks. In this paper, we observe that TPL reuse typically involves not just isolated functions but also areas encompassing several adjacent functions on the Function Call Graph (FCG). We introduce LibAM, a novel Area Matching framework that connects isolated functions into function areas on FCG and detects TPLs by comparing the similarity of these function areas. Furthermore, LibAM is the first approach capable of detecting the exact reuse areas on FCG and offering substantial benefits for downstream tasks. Experimental results demonstrate that LibAM outperforms all existing TPL detection methods and provides interpretable evidence for TPL detection results by identifying exact reuse areas. We also evaluate LibAM's accuracy on large-scale, real-world binaries in IoT firmware and generate a list of potential vulnerabilities for these devices. Last but not least, by analyzing the detection results of IoT firmware, we make several interesting findings, such as different target binaries always tend to reuse the same code area of TPL.

翻译：摘要：第三方库（TPL）被开发者广泛用于加速软件开发进程并集成外部功能。然而，不安全的TPL复用可能带来显著的安全风险。现有方法通过提取字符串或执行函数匹配来判断目标二进制程序中是否包含TPL代码，但由于字符串的重复性以及大量相似非同源函数的存在，这些方法往往难以取得令人满意的结果。此外，它们难以识别目标二进制程序中被复用的具体代码片段，导致复杂复用关系的检测变得困难，并阻碍了下游任务的开展。本文发现，TPL复用通常不仅涉及孤立函数，还包含函数调用图（FCG）上若干相邻函数构成的区域。我们提出LibAM——一种新型区域匹配框架，它将FCG上的孤立函数连接为函数区域，并通过比较这些函数区域的相似性来检测TPL。此外，LibAM是首个能够精确检测FCG上复用区域的方案，可为下游任务提供显著优势。实验结果表明，LibAM优于所有现有TPL检测方法，并通过识别精确的复用区域为TPL检测结果提供可解释性证据。我们还评估了LibAM在物联网固件大规模真实二进制程序上的准确性，并生成了这些设备的潜在漏洞列表。最后，通过分析物联网固件的检测结果，我们获得若干有趣发现，例如不同目标二进制程序总是倾向于复用TPL的同一代码区域。