Phishing websites distribute unsolicited content and are frequently used to commit email and internet fraud; detecting them before any user information is submitted is critical. Several efforts have been made to detect these phishing websites in recent years. Most existing approaches use hand-crafted lexical and statistical features from a website's textual content to train classification models to detect phishing web pages. However, these phishing detection approaches have a few challenges, including 1) the tediousness of extracting hand-crafted features, which require specialized domain knowledge to determine which features are useful for a particular platform; and 2) the difficulties encountered by models built on hand-crafted features to capture the semantic patterns in words and characters in URL and HTML content. To address these challenges, this paper proposes WebPhish, an end-to-end deep neural network trained using embedded raw URLs and HTML content to detect website phishing attacks. First, the proposed model automatically employs an embedding technique to extract the corresponding characters into homologous dense vectors. Then, the concatenation layer merges the URL and HTML embedding matrices. Following that, Convolutional layers are used to model its semantic dependencies. Extensive experiments were conducted with real-world phishing data, which yielded an accuracy of 98.1\%, showing that WebPhish outperforms baseline detection approaches in identifying phishing pages.
翻译:钓鱼网站会传播垃圾内容,并常被用于进行电子邮件和互联网诈骗;在用户提交任何信息之前检测到它们至关重要。近年来,已有多种方法被用于检测此类钓鱼网站。现有的大多数方法利用从网站文本内容中手工提取的词汇和统计特征来训练分类模型,以识别钓鱼网页。然而,这些钓鱼检测方法面临若干挑战,包括:1)手工特征提取过程繁琐且需要特定领域的专业知识来确定哪些特征对特定平台有效;2)基于手工特征构建的模型难以捕捉URL和HTML内容中词语与字符的语义模式。为解决这些问题,本文提出了WebPhish,一种利用嵌入的原始URL和HTML内容训练的端到端深度神经网络,用于检测网站钓鱼攻击。首先,该模型自动采用嵌入技术将对应字符提取为同质化的密集向量。接着,拼接层将URL和HTML嵌入矩阵合并。然后,使用卷积层建模其语义依赖关系。基于真实钓鱼数据进行的广泛实验取得了98.1%的准确率,表明WebPhish在识别钓鱼页面方面优于基线检测方法。