On the Abuse and Detection of Polyglot Files

A polyglot is a file that is valid in two or more formats. Polyglot files pose a problem for malware detection systems that route files to format-specific detectors/signatures, as well as file upload and sanitization tools. In this work we found that existing file-format and embedded-file detection tools, even those developed specifically for polyglot files, fail to reliably detect polyglot files used in the wild, leaving organizations vulnerable to attack. To address this issue, we studied the use of polyglot files by malicious actors in the wild, finding $30$ polyglot samples and $15$ attack chains that leveraged polyglot files. In this report, we highlight two well-known APTs whose cyber attack chains relied on polyglot files to bypass detection mechanisms. Using knowledge from our survey of polyglot usage in the wild -- the first of its kind -- we created a novel data set based on adversary techniques. We then trained a machine learning detection solution, PolyConv, using this data set. PolyConv achieves a precision-recall area-under-curve score of $0.999$ with an F1 score of $99.20$% for polyglot detection and $99.47$% for file-format identification, significantly outperforming all other tools tested. We developed a content disarmament and reconstruction tool, ImSan, that successfully sanitized $100$% of the tested image-based polyglots, which were the most common type found via the survey. Our work provides concrete tools and suggestions to enable defenders to better defend themselves against polyglot files, as well as directions for future work to create more robust file specifications and methods of disarmament.

翻译：多格式文件是一种在两种或更多种格式下均有效的文件。多格式文件对恶意软件检测系统构成了挑战，这些系统通常将文件路由至特定格式的检测器/签名库进行处理，同时也对文件上传和净化工具带来了问题。在本研究中，我们发现现有的文件格式与嵌入式文件检测工具（即便是那些专门为多格式文件开发的工具）无法可靠地检测实际环境中使用的多格式文件，导致相关组织面临攻击风险。为解决此问题，我们研究了恶意行为者在实际环境中对多格式文件的使用情况，发现了 $30$ 个多格式文件样本和 $15$ 条利用多格式文件的攻击链。本报告重点分析了两个知名高级持续性威胁（APT）组织，其网络攻击链均依赖多格式文件来规避检测机制。基于我们对实际环境中多格式文件使用情况的首次系统性调查所获知识，我们创建了一个基于攻击者技术的新型数据集。随后，我们利用该数据集训练了一种机器学习检测方案——PolyConv。PolyConv 在多格式文件检测方面取得了 $0.999$ 的精确率-召回率曲线下面积得分，F1 分数达到 $99.20$%；在文件格式识别方面达到 $99.47$%，显著优于所有其他测试工具。我们还开发了内容净化与重构工具 ImSan，该工具成功净化了 $100$% 的测试图像类多格式文件（这是调查中发现的最常见类型）。本研究提供了具体工具与建议，以帮助防御者更好地应对多格式文件威胁，并为未来建立更健壮的文件规范与净化方法指明了方向。

相关内容

TOOLS

关注 1

这个新版本的工具会议系列恢复了从1989年到2012年的50个会议的传统。工具最初是“面向对象语言和系统的技术”，后来发展到包括软件技术的所有创新方面。今天许多最重要的软件概念都是在这里首次引入的。2019年TOOLS 50+1在俄罗斯喀山附近举行，以同样的创新精神、对所有与软件相关的事物的热情、科学稳健性和行业适用性的结合以及欢迎该领域所有趋势和社区的开放态度，延续了该系列。官网链接：http://tools2019.innopolis.ru/