CIC-Trap4Phish: A Unified Multi-Format Dataset for Phishing and Quishing Attachment Detection

Phishing attacks represents one of the primary attack methods which is used by cyber attackers. In many cases, attackers use deceptive emails along with malicious attachments to trick users into giving away sensitive information or installing malware while compromising entire systems. The flexibility of malicious email attachments makes them stand out as a preferred vector for attackers as they can embed harmful content such as malware or malicious URLs inside standard document formats. Although phishing email defenses have improved a lot, attackers continue to abuse attachments, enabling malicious content to bypass security measures. Moreover, another challenge that researches face in training advance models, is lack of an unified and comprehensive dataset that covers the most prevalent data types. To address this gap, we generated CIC-Trap4Phish, a multi-format dataset containing both malicious and benign samples across five categories commonly used in phishing campaigns: Microsoft Word documents, Excel spreadsheets, PDF files, HTML pages, and QR code images. For the first four file types, a set of execution-free static feature pipeline was proposed, designed to capture structural, lexical, and metadata-based indicators without the need to open or execute files. Feature selection was performed using a combination of SHAP analysis and feature importance, yielding compact, discriminative feature subsets for each file type. The selected features were evaluated by using lightweight machine learning models, including Random Forest, XGBoost, and Decision Tree. All models demonstrate high detection accuracy across formats. For QR code-based phishing (quishing), two complementary methods were implemented: image-based detection by employing Convolutional Neural Networks (CNNs) and lexical analysis of decoded URLs using recent lightweight language models.

翻译：网络钓鱼攻击是网络攻击者使用的主要攻击方法之一。在许多情况下，攻击者使用欺骗性电子邮件以及恶意附件来诱使用户泄露敏感信息或安装恶意软件，同时危及整个系统。恶意电子邮件附件的灵活性使其成为攻击者偏好的攻击媒介，因为它们可以在标准文档格式（如Microsoft Word文档、Excel电子表格、PDF文件和HTML页面）中嵌入恶意内容，例如恶意软件或恶意URL。尽管钓鱼邮件防御已取得很大进展，但攻击者仍持续滥用附件，使得恶意内容能够绕过安全措施。此外，研究人员在训练先进模型时面临的另一个挑战是缺乏一个统一且全面的数据集，涵盖最流行的数据类型。为弥补这一空白，我们生成了CIC-Trap4Phish，这是一个多格式数据集，包含钓鱼活动中常用的五类文件的恶意和良性样本：Microsoft Word文档、Excel电子表格、PDF文件、HTML页面和二维码图像。针对前四种文件类型，我们提出了一套免执行的静态特征提取流程，旨在捕获结构、词法和基于元数据的特征指标，而无需打开或执行文件。特征选择通过结合SHAP分析和特征重要性进行，为每种文件类型生成了紧凑且具有判别力的特征子集。所选特征使用轻量级机器学习模型（包括随机森林、XGBoost和决策树）进行评估。所有模型在不同格式上都表现出较高的检测准确率。针对基于二维码的钓鱼（Quishing），我们实现了两种互补的方法：通过使用卷积神经网络（CNN）进行基于图像的检测，以及使用最新的轻量级语言模型对解码后的URL进行词法分析。