Look back, look around: a systematic analysis of effective predictors for new outlinks in focused Web crawling

from arxiv, 23 pages, 15 figures, 4 tables, uses arxiv.sty, added new title, heuristic features and their results added, figures 7, 14, and 15 updated

Small and medium enterprises rely on detailed Web analytics to be informed about their market and competition. Focused crawlers meet this demand by crawling and indexing specific parts of the Web. Critically, a focused crawler must quickly find new pages that have not yet been indexed. Since a new page can be discovered only by following a new outlink, predicting new outlinks is very relevant in practice. In the literature, many feature designs have been proposed for predicting changes in the Web. In this work we provide a structured analysis of this problem, using new outlinks as our running prediction target. Specifically, we unify earlier feature designs in a taxonomic arrangement of features along two dimensions: static versus dynamic features, and features of a page versus features of the network around it. Within this taxonomy, complemented by our new (mainly, dynamic network) features, we identify best predictors for new outlinks. Our main conclusion is that most informative features are the recent history of new outlinks on a page itself, and on its content-related pages. Hence, we propose a new 'look back, look around' (LBLA) model, that uses only these features. With the obtained predictions, we design a number of scoring functions to guide a focused crawler to pages with most new outlinks, and compare their performance. Interestingly, the LBLA approach proved extremely effective, outperforming even the models that use a most complete set of features. One of the learners we use, is the recent NGBoost method that assumes a Poisson distribution for the number of new outlinks on a page, and learns its parameters. This connects the two so far unrelated avenues in the literature: predictions based on features of a page, and those based on probabilistic modeling. All experiments were carried out on an original dataset, made available by a commercial focused crawler.

翻译：中小企业依靠详细的网络分析来了解它们的市场和竞争情况。焦点爬行者通过爬行和索引化网络的特定部分来满足这一需求。关键是, 焦点爬行者必须迅速找到尚未索引的新页面。由于新页面只能通过遵循新的外部链接才能发现, 预测新的外部链接在实践上非常相关。在文献中, 提出了许多功能设计来预测网络的变化。在这项工作中, 我们用新的外部链接作为我们运行的预测目标, 提供了对这一问题的结构性分析。具体地说, 我们统一了早期的特征设计, 在两个维度的分类安排中: 静态和动态特性, 以及页面周围的页面特征和网络特征。在这个分类中, 我们只能通过新的外链接( 主要是动态网络) 来找到新的外部链接的最佳预测。我们的主要结论是, 多数信息性特征是新式的外链接的历史, 用新的外链接作为我们运行的链接的链接。因此, 我们提出了一个新的“ 回溯看, ” ( LLA) 一个最不相关的文献, 一个最不相关的预言, 使用这些预言的预言, 只能使用这些预言的预言。