Phishing websites now rely heavily on visual imitation-copied logos, similar layouts, and matching colours-to avoid detection by text- and URL-based systems. This paper presents a deep learning approach that uses webpage screenshots for image-based phishing detection. Two vision models, ConvNeXt-Tiny and Vision Transformer (ViT-Base), were tested to see how well they handle visually deceptive phishing pages. The framework covers dataset creation, preprocessing, transfer learning with ImageNet weights, and evaluation using different decision thresholds. The results show that ConvNeXt-Tiny performs the best overall, achieving the highest F1-score at the optimised threshold and running more efficiently than ViT-Base. This highlights the strength of convolutional models for visual phishing detection and shows why threshold tuning is important for real-world deployment. As future work, the curated dataset used in this study will be released to support reproducibility and encourage further research in this area. Unlike many existing studies that primarily report accuracy, this work places greater emphasis on threshold-aware evaluation to better reflect real-world deployment conditions. By examining precision, recall, and F1-score across different decision thresholds, the study identifies operating points that balance detection performance and false-alarm control. In addition, the side-by-side comparison of ConvNeXt-Tiny and ViT-Base under the same experimental setup offers practical insights into how convolutional and transformer-based architectures differ in robustness and computational efficiency for visual phishing detection.
翻译:钓鱼网站如今严重依赖视觉模仿——复制标志、相似布局和匹配颜色——以规避基于文本和URL的检测系统。本文提出一种深度学习方法,利用网页截图实现基于图像的钓鱼检测。我们测试了两种视觉模型,ConvNeXt-Tiny和Vision Transformer(ViT-Base),以评估它们处理视觉欺骗性钓鱼页面的能力。该框架涵盖数据集创建、预处理、使用ImageNet权重的迁移学习以及基于不同决策阈值的评估。结果表明,ConvNeXt-Tiny整体表现最佳,在优化阈值下取得了最高的F1分数,并且运行效率优于ViT-Base。这凸显了卷积模型在视觉钓鱼检测中的优势,并展示了阈值调整对实际部署的重要性。作为未来工作,本研究使用的精选数据集将被发布,以支持可重复性并鼓励该领域的进一步研究。与许多主要报告准确率的现有研究不同,本文更侧重于阈值感知评估,以更好地反映实际部署条件。通过考察不同决策阈值下的精确率、召回率和F1分数,本研究识别出能够平衡检测性能与误报控制的操作点。此外,在同一实验设置下对ConvNeXt-Tiny和ViT-Base进行并排比较,为卷积架构与Transformer架构在视觉钓鱼检测中的鲁棒性和计算效率差异提供了实践见解。