A Unified Evaluation of Learning-Based Similarity Techniques for Malware Detection

Cryptographic digests (e.g., MD5, SHA-256) are designed to provide exact identity. Any single-bit change in the input produces a completely different hash, which is ideal for integrity verification but limits their usefulness in many real-world tasks like threat hunting, malware analysis and digital forensics, where adversaries routinely introduce minor transformations. Similarity-based techniques address this limitation by enabling approximate matching, allowing related byte sequences to produce measurably similar fingerprints. Modern enterprises manage tens of thousands of endpoints with billions of files, making the effectiveness and scalability of the proposed techniques more important than ever in security applications. Security researchers have proposed a range of approaches, including similarity digests and locality-sensitive hashes (e.g., ssdeep, sdhash, TLSH), as well as more recent machine-learning-based methods that generate embeddings from file features. However, these techniques have largely been evaluated in isolation, using disparate datasets and evaluation criteria. This paper presents a systematic comparison of learning-based classification and similarity methods using large, publicly available datasets. We evaluate each method under a unified experimental framework with industry-accepted metrics. To our knowledge, this is the first reproducible study to benchmark these diverse learning-based similarity techniques side by side for real-world security workloads. Our results show that no single approach performs well across all dimensions; instead, each exhibits distinct trade-offs, indicating that effective malware analysis and threat-hunting platforms must combine complementary classification and similarity techniques rather than rely on a single method.

翻译：密码学摘要（如MD5、SHA-256）旨在提供精确身份认证。输入数据的任何单比特变化都会产生完全不同的哈希值，这虽然适用于完整性验证，但在威胁狩猎、恶意软件分析和数字取证等实际任务中作用有限——攻击者通常会进行细微变换。基于相似性的技术通过支持近似匹配来解决这一局限，使相关的字节序列能够生成可度量的相似指纹。现代企业管理者数万台终端设备与数十亿文件，使得所提技术在安全应用中的有效性与可扩展性比以往更为关键。安全研究人员已提出多种方案，包括相似性摘要与局部敏感哈希（如ssdeep、sdhash、TLSH），以及近期基于机器学习的方法——通过文件特征生成嵌入向量。然而，这些技术大多在孤立环境下使用不同数据集和评估标准进行验证。本文利用大规模公开数据集，对基于学习的分类与相似性方法进行了系统比较。我们在统一的实验框架下采用行业公认指标评估每种方法。据我们所知，这是首个可复现的研究，能够并行评估多种基于学习的相似性技术在实际安全任务中的性能。研究结果表明，没有任何单一方法能在所有维度表现优异；相反，每种方法均呈现独特的权衡特性，这表明有效的恶意软件分析与威胁狩猎平台必须整合互补的分类与相似性技术，而非依赖单一方法。