The number of web pages is growing at an exponential rate, accumulating massive amounts of data on the web. It is one of the key processes to classify webpages in web information mining. Some classical methods are based on manually building features of web pages and training classifiers based on machine learning or deep learning. However, building features manually requires specific domain knowledge and usually takes a long time to validate the validity of features. Considering webpages generated by the combination of text and HTML Document Object Model(DOM) trees, we propose a representation and classification method based on a pre-trained language model and graph neural network, named PLM-GNN. It is based on the joint encoding of text and HTML DOM trees in the web pages. It performs well on the KI-04 and SWDE datasets and on practical dataset AHS for the project of scholar's homepage crawling.
翻译:网页数量呈指数级增长,在网络上积累了海量数据。在网页信息挖掘中,网页分类是关键流程之一。传统方法依赖于人工构建网页特征,并基于机器学习或深度学习训练分类器。然而,人工构建特征需要特定的领域知识,且通常需要较长时间来验证特征的有效性。考虑到网页由文本与HTML文档对象模型(DOM)树结合生成,我们提出了一种基于预训练语言模型与图神经网络的表示与分类方法,命名为PLM-GNN。该方法基于网页中文本与HTML DOM树的联合编码,在KI-04和SWDE数据集以及学者主页爬取项目的实际数据集AHS上均表现出色。