Phishing attacks remain among the most prevalent cybersecurity threats, causing significant financial losses for individuals and organizations worldwide. This paper presents a machine learning-based phishing email detection system that analyzes email body content using natural language processing (NLP) techniques. Unlike existing approaches that primarily focus on URL analysis, our system classifies emails by extracting contextual features from the entire email content. We evaluated two classification models, Naive Bayes and Logistic Regression, trained on a combined corpus of 53,973 labeled emails from three distinct datasets. Our preprocessing pipeline incorporates lowercasing, tokenization, stop-word removal, and lemmatization, followed by Term Frequency-Inverse Document Frequency (TF-IDF) feature extraction with unigrams and bigrams. Experimental results demonstrate that Logistic Regression achieves 95.41% accuracy with an F1-score of 94.33%, outperforming Naive Bayes by 1.55 percentage points. The system was deployed as a web application with a FastAPI backend, providing real-time phishing classification with average response times of 127ms.
翻译:网络钓鱼攻击仍是最普遍的网络安全威胁之一,给全球个人和组织造成重大经济损失。本文提出一种基于机器学习的钓鱼邮件检测系统,该系统采用自然语言处理技术分析邮件正文内容。与现有主要关注URL分析的方法不同,我们的系统通过提取邮件全文的上下文特征进行分类。我们评估了朴素贝叶斯和逻辑回归两种分类模型,使用来自三个不同数据集的53,973封标记邮件构成的组合语料库进行训练。预处理流程包括小写转换、分词、停用词移除和词形还原,随后采用基于一元分词和二元分词的词频-逆文档频率特征提取。实验结果表明,逻辑回归达到95.41%的准确率与94.33%的F1分数,比朴素贝叶斯高1.55个百分点。该系统以FastAPI后端部署为Web应用,提供平均响应时间127毫秒的实时钓鱼分类功能。