Measuring and Modeling the Free Content Web

Free content websites that provide free books, music, games, movies, etc., have existed on the Internet for many years. While it is a common belief that such websites might be different from premium websites providing the same content types, an analysis that supports this belief is lacking in the literature. In particular, it is unclear if those websites are as safe as their premium counterparts. In this paper, we set out to investigate, by analysis and quantification, the similarities and differences between free content and premium websites, including their risk profiles. To conduct this analysis, we assembled a list of 834 free content websites offering books, games, movies, music, and software, and 728 premium websites offering content of the same type. We then contribute domain-, content-, and risk-level analysis, examining and contrasting the websites' domain names, creation times, SSL certificates, HTTP requests, page size, average load time, and content type. For risk analysis, we consider and examine the maliciousness of these websites at the website- and component-level. Among other interesting findings, we show that free content websites tend to be vastly distributed across the TLDs and exhibit more dynamics with an upward trend for newly registered domains. Moreover, the free content websites are 4.5 times more likely to utilize an expired certificate, 19 times more likely to be malicious at the website level, and 2.64 times more likely to be malicious at the component level. Encouraged by the clear differences between the two types of websites, we explore the automation and generalization of the risk modeling of the free content risky websites, showing that a simple machine learning-based technique can produce 86.81\% accuracy in identifying them.

翻译：免费提供图书、音乐、游戏、电影等内容的免费内容网站已在互联网上存在多年。尽管人们普遍认为这类网站可能与提供相同内容类型的付费网站存在差异，但文献中缺乏支持这一观点的分析。特别地，尚不清楚这些网站是否与付费网站同样安全。本文旨在通过分析与量化研究，探究免费内容网站与付费网站（包括其风险特征）之间的异同。为此，我们构建了包含834个免费内容网站（提供图书、游戏、电影、音乐和软件）和728个同类内容付费网站的清单。随后，我们从域名、内容及风险三个维度展开分析，对比网站的域名注册信息、创建时间、SSL证书、HTTP请求、页面大小、平均加载时间及内容类型。在风险分析中，我们从网站级和组件级两个层面评估其恶意性。研究结果包括：免费内容网站的顶级域名分布更广泛，且新增注册域名呈现更明显的上升动态；此外，与付费网站相比，免费内容网站使用过期证书的可能性高出4.5倍，网站级恶意性高出19倍，组件级恶意性高出2.64倍。基于两类网站的显著差异，我们进一步探索了免费风险网站的自动化与泛化风险建模，表明基于简单机器学习的技术可达到86.81%的识别准确率。