In this paper, we propose a novel method for extracting information from HTML tables with similar contents but with a different structure. We aim to integrate multiple HTML tables into a single table for retrieval of information containing in various Web pages. The method is designed by extending tree-structured LSTM, the neural network for tree-structured data, in order to extract information that is both linguistic and structural information of HTML data. We evaluate the proposed method through experiments using real data published on the WWW.
翻译:本文提出了一种新颖的方法,用于从内容相似但结构不同的HTML表格中抽取信息。我们的目标是将多个HTML表格整合为单一表格,以实现跨网页信息的统一检索。该方法通过扩展树结构LSTM(一种面向树形数据的神经网络)进行设计,旨在同时提取HTML数据的语言特征与结构特征。我们使用万维网上发布的真实数据进行了实验评估。