Typosquatting is a long-standing cyber threat that exploits human error in typing URLs to deceive users, distribute malware, and conduct phishing attacks. With the proliferation of domain names and new Top-Level Domains (TLDs), typosquatting techniques have grown more sophisticated, posing significant risks to individuals, businesses, and national cybersecurity infrastructure. Traditional detection methods primarily focus on well-known impersonation patterns, leaving gaps in identifying more complex attacks. This study introduces a novel approach leveraging large language models (LLMs) to enhance typosquatting detection. By training an LLM on character-level transformations and pattern-based heuristics rather than domain-specific data, a more adaptable and resilient detection mechanism develops. Experimental results indicate that the Phi-4 14B model outperformed other tested models when properly fine tuned achieving a 98% accuracy rate with only a few thousand training samples. This research highlights the potential of LLMs in cybersecurity applications, specifically in mitigating domain-based deception tactics, and provides insights into optimizing machine learning strategies for threat detection.
翻译:域名抢注是一种长期存在的网络威胁,它利用用户在输入URL时的人为错误来欺骗用户、传播恶意软件和实施网络钓鱼攻击。随着域名和新顶级域名的激增,域名抢注技术已变得更加复杂,对个人、企业和国家网络安全基础设施构成了重大风险。传统的检测方法主要关注已知的仿冒模式,在识别更复杂的攻击方面存在不足。本研究提出了一种利用大型语言模型增强域名抢注检测的新方法。通过在字符级转换和基于模式的启发式方法(而非领域特定数据)上训练LLM,可以开发出更具适应性和鲁棒性的检测机制。实验结果表明,经过适当微调的Phi-4 14B模型在仅使用数千个训练样本的情况下,达到了98%的准确率,优于其他测试模型。这项研究凸显了LLM在网络安全应用中的潜力,特别是在缓解基于域名的欺骗策略方面,并为优化威胁检测的机器学习策略提供了见解。