In cybersecurity, allow lists play a crucial role in distinguishing safe websites from potential threats. Conventional methods for compiling allow lists, focusing heavily on website popularity, often overlook infrequently visited legitimate domains. This paper introduces DomainHarvester, a system aimed at generating allow lists that include trustworthy yet infrequently visited domains. By adopting an innovative bottom-up methodology that leverages the web's hyperlink structure, DomainHarvester identifies legitimate yet underrepresented domains. The system uses seed URLs to gather domain names, employing machine learning with a Transformer-based approach to assess their trustworthiness. DomainHarvester has developed two distinct allow lists: one with a global focus and another emphasizing local relevance. Compared to six existing top lists, DomainHarvester's allow lists show minimal overlaps, 4\% globally and 0.1\% locally, while significantly reducing the risk of including malicious domains, thereby enhancing security. The contributions of this research are substantial, illuminating the overlooked aspect of trustworthy yet underrepresented domains and introducing DomainHarvester, a system that goes beyond traditional popularity-based metrics. Our methodology enhances the inclusivity and precision of allow lists, offering significant advantages to users and businesses worldwide, especially in non-English speaking regions.
翻译:在网络安全领域,允许列表在区分安全网站与潜在威胁方面发挥着关键作用。传统的允许列表编制方法过度关注网站流行度,往往忽略了访问频率较低但合法的域名。本文介绍了DomainHarvester系统,该系统旨在生成包含可信但访问频率较低域名的允许列表。通过采用一种创新的自底向上方法,该方法利用网络的超链接结构,DomainHarvester能够识别合法但代表性不足的域名。该系统使用种子URL收集域名,并采用基于Transformer的机器学习方法评估其可信度。DomainHarvester开发了两个不同的允许列表:一个具有全球焦点,另一个强调本地相关性。与六个现有顶级列表相比,DomainHarvester的允许列表显示出最小的重叠,全球重叠率为4%,本地重叠率为0.1%,同时显著降低了包含恶意域名的风险,从而增强了安全性。本研究的贡献是显著的,揭示了可信但代表性不足域名这一被忽视的方面,并介绍了DomainHarvester系统,该系统超越了传统的基于流行度的指标。我们的方法提高了允许列表的包容性和精确性,为全球用户和企业,特别是在非英语地区,提供了显著优势。