Combining AI and AM - Improving Approximate Matching through Transformer Networks

Approximate matching (AM) is a concept in digital forensics to determine the similarity between digital artifacts. An important use case of AM is the reliable and efficient detection of case-relevant data structures on a blacklist, if only fragments of the original are available. For instance, if only a cluster of indexed malware is still present during the digital forensic investigation, the AM algorithm shall be able to assign the fragment to the blacklisted malware. However, traditional AM functions like TLSH and ssdeep fail to detect files based on their fragments if the presented piece is relatively small compared to the overall file size. A second well-known issue with traditional AM algorithms is the lack of scaling due to the ever-increasing lookup databases. We propose an improved matching algorithm based on transformer models from the field of natural language processing. We call our approach Deep Learning Approximate Matching (DLAM). As a concept from artificial intelligence (AI), DLAM gets knowledge of characteristic blacklisted patterns during its training phase. Then DLAM is able to detect the patterns in a typically much larger file, that is DLAM focuses on the use case of fragment detection. We reveal that DLAM has three key advantages compared to the prominent conventional approaches TLSH and ssdeep. First, it makes the tedious extraction of known to be bad parts obsolete, which is necessary until now before any search for them with AM algorithms. This allows efficient classification of files on a much larger scale, which is important due to exponentially increasing data to be investigated. Second, depending on the use case, DLAM achieves a similar or even significantly higher accuracy in recovering fragments of blacklisted files. Third, we show that DLAM enables the detection of file correlations in the output of TLSH and ssdeep even for small fragment sizes.

翻译：近似匹配（AM）是数字取证中用于确定数字制品相似性的概念。其重要应用场景之一是：在仅存原始文件碎片的情况下，可靠高效地检测黑名单中的案件相关数据结构。例如，当数字取证调查中仅存被索引恶意软件的某个簇时，AM算法需能将此碎片关联至黑名单恶意软件。然而，传统AM函数（如TLSH和ssdeep）在碎片尺寸相对于整体文件较小时，无法基于碎片完成文件检测。传统AM算法的第二个已知问题在于，随着查找数据库的持续增长，其扩展性不足。我们提出一种基于自然语言处理领域Transformer模型的改进匹配算法，命名为深度学习近似匹配（DLAM）。作为人工智能概念，DLAM在训练阶段获取黑名单模式的特征知识，继而能在显著更大的文件中检测这些模式——即DLAM聚焦于碎片检测的应用场景。我们揭示DLAM相较于主流传统方法TLSH和ssdeep具有三大优势：第一，它能消除当前必须预先提取已知恶意部分的繁琐步骤，而此前使用AM算法搜索前必须完成此步骤，从而支持更大规模的高效文件分类，这对指数级增长的待查数据量至关重要；第二，根据应用场景不同，DLAM在恢复黑名单文件碎片时能达到相似甚至显著更高的精度；第三，我们证明DLAM能够在TLSH和ssdeep的输出中检测文件关联性，即使面对极小尺寸的碎片也依然有效。