Phishing remains a pervasive and growing threat, inflicting heavy economic and reputational damage. While machine learning has been effective in real-time detection of phishing attacks, progress is hindered by lack of large, high-quality datasets and benchmarks. In addition to poor-quality due to challenges in data collection, existing datasets suffer from leakage and unrealistic base rates, leading to overly optimistic performance results. In this paper, we introduce PhreshPhish, a large-scale, high-quality dataset of phishing websites that addresses these limitations. Compared to existing public datasets, PhreshPhish is substantially larger and provides significantly higher quality, as measured by the estimated rate of invalid or mislabeled data points. Additionally, we propose a comprehensive suite of benchmark datasets specifically designed for realistic model evaluation by minimizing leakage, increasing task difficulty, enhancing dataset diversity, and adjustment of base rates more likely to be seen in the real world. We train and evaluate multiple solution approaches to provide baseline performance on the benchmark sets. We believe the availability of this dataset and benchmarks will enable realistic, standardized model comparison and foster further advances in phishing detection. The datasets and benchmarks are available on Hugging Face (https://huggingface.co/datasets/phreshphish/phreshphish).
翻译:网络钓鱼仍然是一种普遍存在且日益严重的威胁,造成重大的经济和声誉损失。尽管机器学习在实时检测网络钓鱼攻击方面已证明有效,但进展因缺乏大规模、高质量的数据集和基准测试而受阻。除了因数据收集挑战导致质量低下外,现有数据集还存在数据泄露和不切实际的基准率问题,导致性能评估结果过于乐观。本文介绍了PhreshPhish,一个旨在解决这些局限性的大规模、高质量网络钓鱼网站数据集。与现有公共数据集相比,PhreshPhish规模显著更大,且通过无效或误标数据点估计率衡量,其数据质量显著更高。此外,我们提出了一套全面的基准测试数据集,专门设计用于实现更真实的模型评估,其方法包括:最小化数据泄露、增加任务难度、增强数据集多样性,以及调整至更接近现实世界可能出现的基准率。我们训练并评估了多种解决方案,以在基准测试集上提供基线性能。我们相信,该数据集和基准测试的可用性将支持现实、标准化的模型比较,并推动网络钓鱼检测领域的进一步发展。数据集和基准测试可在Hugging Face平台获取(https://huggingface.co/datasets/phreshphish/phreshphish)。